Email automation
Email agent for invoice chasing: a Haarlem AR rebuild
A 19-person Haarlem accountancy practice had four people on invoice chasing. Here is how we rebuilt that ritual around an email agent that drafts but never sends on its own.

The Tuesday morning ritual
It is 11:14 on a Tuesday in Haarlem. Sanne, who runs accounts receivable at a 19-person tax-and-bookkeeping practice on the Spaarne, has 287 open invoices on her screen. Three of them are sixty-plus days overdue. One belongs to a Berlin client who only replies in German. One is a logistics firm in Antwerp who only reads English. The rest are Dutch SMEs who treat reminders as a polite suggestion.
Sanne has done this every Tuesday for six years. So have three of her colleagues. Together they spend roughly 22 hours a week writing, sending, and tracking invoice-chase emails. None of them dislike the work exactly. They dislike the part where they are reading their own template back to themselves for the four-hundredth time and writing "I hope this email finds you well" in a language they did not study in school.
This is the story of how we replaced that ritual. Not with a bot that hits send on its own. With an email agent that drafts, classifies, and lines up the next move, then hands the human the easy part: read, decide, click.
The work behind the chase
Before we wrote a line of code, we sat in the AR office for two days. We do this on every email-automation project. The slide deck never matches the inbox.
What the four-person team actually did broke down roughly like this. About 40% of the week was the obvious bit: pulling the open-invoice list, picking who to chase, writing the email. About 25% was translation work. Roughly 15% of clients are German-speaking and 10% English-speaking; Sanne and one colleague were the only Dutch-and-German speakers, so emails to German clients piled up on their two queues. About 20% was triage: a reply comes in, is it a payment confirmation, a dispute, a bankruptcy notice, a request to split into three installments? Each one routes somewhere different. The last 15% was the part nobody put on the job description: chasing the chase. Logging that an email was sent. Updating the CRM. Booking a follow-up reminder for next week. Pulling the right invoice PDF out of the accounting system to attach.
The lever is not in the writing. It is in the translation, the triage, and the chasing-the-chase. That is what we built the agent around.
Shape of the agent
The system has four parts.
A reader. It connects to the firm's Exact Online accounting system over its REST API and pulls the open-invoice list every morning at 06:00 Amsterdam time. For each invoice past its due date by a configurable threshold (the firm landed on 7 days, 21 days, 45 days as their three buckets), it fetches the client record, the original invoice PDF, the prior reminder history, and any inbound replies on that thread.
A writer. For each invoice that needs a touch, it drafts an email in the client's preferred language. The language hint comes from the client record in Exact, with a fallback to a small classifier on prior correspondence. The tone matches the bucket: warm at 7 days, firm at 21, transactional at 45. Every draft cites the invoice number, the due date, the amount, and an IBAN. The Berlin client gets German that does not read like German written by a Dutch person.
A reviewer. This is the part the firm did not want to give up. Every draft lands in a queue inside the firm's existing Microsoft 365 mailbox, tagged with the invoice ID. Sanne or one of her colleagues opens the draft, reads it, edits if needed (rare, after week two), and hits send. We measured the time from draft-ready to human-clicks-send. The median settled at 38 seconds.
A bookkeeper. After send, the agent logs the send event back to Exact, schedules the next follow-up touch based on the new bucket, and waits for the reply. When a reply arrives, the agent classifies it (payment confirmation, dispute, installment request, out-of-office, hostile, bankruptcy) and routes it to the right human or the right next draft.
Drafting in three languages without sounding like a translator
This is where most invoice-chase automations fall apart. Pick any cloud accounting platform and look at its built-in reminder templates. They read like a stamp.
Our writer does not run on templates. It runs on a small prompt that knows three things: the client (industry, history, the gentle observation that this one paid late twice last year and might appreciate a less apologetic tone), the invoice context (amount, days overdue, whether anything is disputed in the thread history), and the house voice of the firm.
The house voice was the hard part. We sat down with the senior partner for an afternoon and reverse-engineered it from eighteen months of his own outbound emails. He has a specific way of opening a German email that does not sound like a textbook. We codified that. Same for the Dutch and English variants.
voice:
nl:
opening_warm: "Beste {first_name},"
opening_firm: "Geachte heer/mevrouw {last_name},"
sign_off: "Met vriendelijke groet,"
do_not_use:
- "wij hopen dat deze e-mail u in goede gezondheid bereikt"
de:
opening_warm: "Liebe Frau {last_name},"
opening_firm: "Sehr geehrte Damen und Herren,"
sign_off: "Mit freundlichen Grüßen,"
en:
opening_warm: "Hi {first_name},"
opening_firm: "Dear {last_name},"
sign_off: "Kind regards,"
do_not_use:
- "I hope this email finds you well"
That snippet is a slice. The real config is about 140 lines and includes idioms to avoid, escalation phrases per bucket, and the firm-specific quirks the senior partner could only name when we showed him bad drafts and asked what was wrong with them.
The human nod
The firm was clear from week one: nothing leaves the building without a person reading it.
We agreed. Not just for the obvious legal reason, though that matters. For two practical ones.
First, the failure modes of a fully-automated invoice chaser are catastrophic and quiet. Send a firm reminder to a client whose mother died yesterday and you do not lose the invoice. You lose the relationship. The 38 seconds a human spends reading the draft is the cheapest insurance policy in the system.
Second, the drafts get better when the humans edit them. Every edit is captured. Every edit becomes a training signal for the next version of the prompt and the few-shot examples. After six weeks the edit rate on Dutch drafts had fallen from 31% to 4%. On German drafts, from 19% to 7%. On English, from 22% to 6%.
Auto-send is the part of email automation you can almost always delete. The cost of keeping a human in the loop is measured in seconds. The cost of taking them out is measured in client relationships.
The classifier that decides what not to send
The most useful part of the system, by Sanne's own account, is not the writer. It is the classifier on the reply side.
When a client replies to a chase, six things tend to happen. Payment confirmations (47% of replies). Requests for a copy of the invoice (18%). Out-of-office bounces (12%). Installment-plan requests (11%). Disputes (9%). Angry, bankrupt, or otherwise weird (3%). Before the agent, every one of these landed in the same shared inbox and a human had to read it, decide, and route. Now, only categories four through six reach a human cold. The other categories get routed: payment confirmations close the loop automatically and reconcile against Exact, invoice-copy requests trigger a re-attach draft (the human still nods), out-of-office replies push the next follow-up touch to the date the OOO ends.
The classifier is a small model running locally on the firm's own machine, not a frontier API. We tested both. The frontier model was 2% more accurate and roughly 60 times more expensive at the firm's volume. The local one is good enough and keeps invoice content inside the office, which makes the firm's compliance officer happy and lines up cleanly with GDPR Article 28 on processor relationships. The recent wave of laptop-sized open models, including the Gemma 4 QAT release that drew a long thread on Hacker News this week, keeps closing the gap between "runs in the cupboard" and "good enough for production classification." That trend is what made the local-only architecture viable here.
Numbers after six weeks
Six weeks in, here is what changed.
- Hours spent on AR per week: down from about 22 to about 6. Most of the remaining six are the parts that should stay human (disputes, installment negotiations, the awkward phone calls).
- Days Sales Outstanding: down from 47 to 38. The nine-day improvement is partly the agent (consistent touches at the right intervals) and partly a side effect of better data hygiene that fell out of the integration work.
- Reply rate on first-touch reminders: up from 31% to 44%. We attribute most of that lift to language match. German clients reply more often when chased in German.
- Edit rate on agent drafts: 4% NL, 7% DE, 6% EN. Down from 31%, 19%, and 22% in week one.
- Mistakes the agent made that a human caught: 11. Mistakes the agent made that a human did not catch: 0, so far.
The four-person team is still four people. None of them got fired. Two of them moved to advisory work on the senior partner's book, which is what they wanted. One picked up the firm's expansion into UK clients. One is happier doing the six hours of AR work that is actually interesting.
What we got wrong on the way
Three things, recorded so we do not do them again.
We over-engineered the bucket logic. Our first version had five overdue buckets with separate tone profiles. The firm pushed back: three buckets, three tones, done. We were inventing complexity the operations team did not want. When the client says simpler, the client is right.
We initially routed every reply through a frontier model. Cost spiked 40x in week two when one client got into a 17-email back-and-forth dispute and each new message re-classified the whole thread. We switched the classifier to a local model and capped the context window. Cost came back under control. Accuracy held.
We forgot about VAT. The first version of the writer cheerfully drafted reminders for invoices where the dispute was about VAT treatment, not non-payment. The senior partner caught it on day three. We added a "dispute detected in thread" flag that flips the draft into a "we should talk before we chase again" template that routes to a human, full stop.
If your invoice-chase agent reads the thread history, it must also read the thread history for disputes. The most expensive mistake an AR agent can make is chasing a client about an invoice they have already told you is wrong.
The five-minute audit
If you run an AR function and the words above sound familiar, here is a way to find out whether there is an agent-shaped hole in your week.
Open your sent folder. Filter for the last thirty days. Search for "vriendelijke herinnering," "kind reminder," "freundliche Erinnerung." Count the results. Multiply by four minutes (the average time per touch in our measurements). That number is your weekly hours on invoice chasing.
If it is over 8, an email agent will pay for itself in the first quarter. If it is over 15, it will pay for itself in week three. If it is under 4, you do not have an automation problem, you have a collections problem, and the answer is a phone call.
When we built this for the Haarlem practice, the thing we kept running into was the gap between "the agent can technically send" and "the firm trusts it to send." We solved it by removing auto-send entirely and treating the 38-second human nod as a feature, not a friction. Most of the AI agents we ship for operations teams end up shaped the same way: the cheapest insurance is the human click.
Key takeaway
Auto-send is the part of email automation you can almost always delete. In an AR agent, the cheapest insurance is the 38-second human click.
FAQ
Why not let the agent send invoice reminders automatically?
The failure modes are quiet and expensive. A wrongly-toned reminder to a grieving client costs you a relationship, not an invoice. The 38-second human review is the cheapest insurance in the system.
Does an AI really write three languages without sounding translated?
Yes, once you reverse-engineer the firm's house voice in each. We spent an afternoon with the senior partner codifying his actual openings, sign-offs, and forbidden idioms in Dutch, English, and German.
What about GDPR if the agent reads invoice content?
We run the reply classifier on a small local model, not a frontier API. Invoice content stays inside the firm's office. The compliance officer signed off on the architecture against GDPR Article 28.
How long until the agent's drafts stopped needing heavy edits?
About six weeks. The edit rate on Dutch drafts fell from 31% to 4%, on German from 19% to 7%, on English from 22% to 6%. Each human edit feeds back into the prompt and few-shot examples.