Strategy

AI literacy week: teaching accountants to distrust the agent

A senior partner at a 25-person accounting firm almost emailed a client a VAT answer that was confidently wrong. The agent had hallucinated a rate. We ran a week.

Jacob Molkenboer· Founder · A Brand New Company· 1 May 2024· 8 min

Ivory desk with cream envelope sealed in green wax, magnifying glass on a memo, small brass bell with red ribbon.

The almost-sent email

On a Tuesday in March, a senior partner at the firm forwarded us a draft email he was about to send to a logistics client. The client had asked about VAT on a triangular EU transaction. The agent had answered with a 21% rate, a clean three-paragraph explanation, and a confident sentence about reverse charge that was wrong. The partner caught the error because the client's invoice from the previous quarter still sat open on his second monitor and the numbers did not line up.

It was the third near-miss in six weeks. The firm had rolled out an internal agent for tax research in January. By March, the partners had quietly started leaning on it for client-facing work. The managing partner did not call us to remove the agent. He called us to teach the team how to stop trusting it where it should not be trusted.

Why trust calibration beats training

Most AI training inside professional-services firms reads like a vendor demo. It teaches people how to prompt, how to summarise, how to draft. It does not teach them where the agent fails and how to feel that failure coming.

The team had already sat through that kind of training. They knew how to ask the agent for a memo. What they did not know was how to read the difference between a confident answer that was right and a confident answer that was wrong. The two look identical on the screen.

This is the same pattern faculty teaching with AI tools have started flagging in their cohorts. Students end up fluent, confident, and wrong, with no internal alarm bell to tell them something is off. The skill the tools do not teach you is the one you most need: calibrating how much weight to put on any single answer.

Takeaway

Prompting is the easy half of AI literacy. The hard half is knowing when the answer in front of you is the kind your firm cannot afford to be wrong about.

Who was in the room

We insisted on three things before the week started. Every partner attended every session, including the one who said he did not need it. The agent vendor was not invited, because the vendor's job is to demonstrate capability and ours that week was to demonstrate fragility. And the firm's IT lead sat in the room rather than on call, because most of the policy decisions we ended up making were about access and routing, not about prompts.

Twenty-two of the twenty-five staff joined the live sessions. The other three were on client work that could not move. We recorded the rooms and made them watchable at 1.5x, with a five-question check at the end so anyone catching up had to engage with the material rather than play it in a background tab.

Day 1: Showing the cracks

We started the week with a confidence audit. Every team member brought three real questions they had asked the agent in the previous month. We ran them again in front of the room. For each answer we asked two things in sequence: how confident does this read, and how confident are we?

On twelve of the eighteen questions we ran that morning, the agent's answer read as more confident than the team actually felt about it. On three, the answer was wrong and nobody had caught it at the time. One of the three was the kleineondernemersregeling threshold, which the agent had carried forward from a 2023 source even though the rule had shifted. Another was a reduced-rate question on e-books versus print, where the agent had given a clean answer that ignored the format-neutral rule that came in years ago.

The room got quiet. Not because the agent was bad, but because the answers had already been used.

Day 2: A shared vocabulary for confidence

By Wednesday we had a working language. The team agreed on three categories for any agent output that touched a rate, a threshold, or a deadline:

Green: a restated rule from the firm's own internal knowledge base, with a citation that resolves to a current Belastingdienst page or a published tax law article. Ships as drafted.
Yellow: an answer that sounds right but cites nothing checkable, or cites a blog post, or cites a source older than 18 months. A human re-derives the rate or rule against the Belastingdienst page before it leaves the firm.
Red: anything involving cross-border VAT, recent rate changes, or a scheme threshold near a known cliff. The agent drafts the structure of the memo. A senior fills in the numbers from a primary source.

The categories are simple on purpose. The point was not to build a perfect taxonomy. It was to give junior staff a phrase they could say out loud in a busy week, without feeling like they were second-guessing the partner who had forwarded the answer.

Day 3: The edge-case lab

Thursday morning we ran what we called the trap test. We had spent the previous week sitting with three of the firm's senior tax advisors, asking them where they had seen the agent slip. They gave us thirty edge-case prompts: triangular EU transactions, MOSS thresholds, reverse charge on intra-community services, the solar panel zero rate, hospitality tips, the new e-publishing rule, real estate tax point timing, and a handful of small-business scheme corner cases.

We ran all thirty against the firm's agent in front of the team. Twenty-one came back with at least one factual error. Nine of those were the kind of error that would have crossed a client's desk without a follow-up question, because the wording was confident and the structure was professional.

The worst single answer was on a place-of-supply question for an event held in Belgium with attendees billed from a Dutch entity. The agent confidently applied the general B2B rule and produced a clean memo. The actual rule for admissions to events is the place where the event physically happens, which would have shifted the VAT treatment entirely. The agent's answer cited no source. Nobody on the team had asked it to.

The exercise did two things at once. It calibrated the team on which areas of tax work the agent was unreliable on. It also gave every person in the room a personal moment of "I would have sent that". That moment matters more than any policy document.

The answers that fooled us were not the ones that sounded uncertain. They were the ones that sounded the most like one of us.
Senior tax advisor, Day 3 debrief

Day 4: Redesigning the workflow

Friday morning we redrew the team's workflow on a whiteboard. The question we kept asking was not "should the agent do this" but "what is the worst outcome if the agent does this and is wrong, and does anyone catch it before it ships".

Three categories of work moved off the agent entirely:

Citing a rate or threshold to a client without a primary-source link from a Dutch government domain.
Answering a question that involved a tax rule change in the last 24 months.
Anything involving cross-border VAT, where the agent's training data conflated EU regimes that were not the same.

What stayed on the agent: drafting the structure of memos the team would fill in, summarising long client emails, restructuring the team's own text into client-readable language, and producing first-pass meeting notes from transcripts. None of those produce a factual claim the agent invented. They all reshape text the firm already trusted.

This is where the recent Anthropic writing on containing Claude across products matches our experience on the ground. Containment is not about whether the model is capable. It is about deciding, per surface, what the model is allowed to be the source of truth for. For an accounting firm, the answer for tax rates is "never, by policy". For draft structure, it is "always, by default".

Day 5: Rules of engagement

We closed the week with a one-page document. Not a policy, a card. The team pinned it next to their monitors. Three rules.

Numbers never autopilot. Any rate, threshold, deadline, or percentage from the agent is yellow until a human has checked it against a primary source. No exceptions for senior staff.
Confidence words trigger a check. If the agent says "always", "never", "in all cases", or "the rate is", read it as a flag, not a fact. Those phrases marked nine of the ten worst answers we found that week.
The agent cites or it doesn't ship. If the output going to a client contains a claim and not a link, it goes back in the queue. The link has to resolve. The team checks one in five at random.

The most expensive failure mode is not the agent being wrong. It is a junior assuming the partner has already checked, and the partner assuming the junior has. Make the checking step explicit, written down, and owned by name.

Six weeks later

We went back at the end of April. The rules card was still pinned up at most desks. The three categories had survived. The trap test was now part of the firm's monthly internal review: a rotating senior runs five fresh edge cases through the agent and shares the worst answer with the team. No shame, just calibration.

Two things did not stick. The confidence audit did not repeat as a formal weekly exercise. People did it in their heads instead, which is probably fine. And one of the partners had begun forwarding agent answers as a "starting draft" to clients with a caveat, which we had explicitly recommended against. Old habits.

The metric that mattered most to the managing partner was simpler than any of this. In the six weeks after the week, zero agent-sourced errors reached a client. In the six weeks before, three had.

That metric understates the cost of the three. One of them would have created an over-claimed input VAT position that a Belastingdienst review would have caught later, with the firm carrying the explanation. The other two would have produced an invoice correction at minimum and a client conversation neither side wanted to have. None of them were catastrophic. All of them would have cost the firm professional time and a quiet drop in the client's confidence that the firm reads before sending.

What we would do differently

If we ran this again, we would shorten the week to four days and use the fifth to write the firm's own internal page of "agent-known-bad" areas. We tried to do that in the margins and it did not get finished. The page needs to be owned by a single senior with calendar time on the books, not built by committee.

When we ran the trust-calibration week for this firm, the thing we kept hitting was that the agent's confident tone was the actual problem, not its factual accuracy. We ended up solving it by giving the team a vocabulary they could use out loud, which is the kind of work that sits between an AI agent rollout and a real organisational change.

The smallest thing you could do today: pick three answers your team has acted on from your agent in the last month, re-run them, and ask the room whether the confidence on the screen matched the confidence in the room. If even one answer fails that test, you have your week's curriculum already.

Key takeaway

Prompting is the easy half of AI literacy. The hard half is calibrating how much weight to put on any one answer, especially when it sounds the most like one of your own people.

FAQ

Do you have to remove the AI agent to run an AI-literacy week?

No. The point is not to remove the agent. It is to make the team accurate about where the agent is unreliable so they keep using it where it actually helps and stop using it where it costs them.

How long does an AI-literacy week need to be?

Five days worked for a 25-person firm. For a team under ten, three days is enough if you compress the confidence audit and the trap test into a single combined session and keep partners in the room for both.

Who should run the trap test?

A senior practitioner from inside the firm, not the AI vendor and not an outside trainer. The edge cases have to come from real recent client work or the team will not believe the results in the room.

What is the single biggest mistake firms make with internal agents?

Treating prompting as the skill that matters. The skill that matters is confidence calibration: knowing when an answer that sounds right is the kind your firm cannot afford to be wrong about.

ai agentsstrategyoperationsbusinesscase study

Building something?

Start a project