Email automation

Email agents in accounting: a 47-second Twinfield filer

A Haarlem accounting firm was babysitting a 9-stage Make.com scenario every morning. We rebuilt it as one email agent. Filing time per invoice dropped to 47 seconds.

Jacob Molkenboer· Founder · A Brand New Company· 1 Oct 2024· 9 min

Cream envelope with green ribbon, brass bell, leather ledger and red wax stamp on ivory desk by window.

The first call came in at 07:42 on a Tuesday. The office manager at a 23-person accounting firm in Haarlem was staring at a red error on stage seven of a Make.com scenario she didn't build and didn't fully understand. The scenario was supposed to read receipts out of a shared inbox, OCR them, classify the supplier, look the supplier up in Twinfield, and file a purchase invoice entry against the right ledger. Instead it had stalled on a PDF from a Belgian fuel station that put the VAT line in a place the regex didn't expect. Nineteen invoices were queued behind it. Her bookkeeper was on the train.

This is the story of why we tore that scenario out and replaced it with a single email agent, and what we learned about the seams between an LLM, a 25-year-old accounting product, and a team that needs to close the month on time.

The old scenario, honestly described

The firm had inherited the Make.com flow from a freelancer who had moved on. It had nine stages. The exact list, from the canvas:

Gmail watcher on facturen@.
Filter on attachment type (PDF, JPG, HEIC).
Upload to Google Drive, dated folder.
OCR via a third-party module.
Regex extractor pulling supplier name, IBAN, invoice number, total, VAT.
Lookup against a Google Sheet that mapped suppliers to Twinfield ledger codes.
Branch: known supplier vs. new supplier.
Twinfield SOAP call to create a purchase transaction.
Reply to the original email with a confirmation, or escalate to the bookkeeper if any field was empty.

On a quiet day it worked. On a busy day the stage-5 regex broke on anything unusual: a receipt photographed on a phone, a Dutch invoice with a foreign VAT format, an attachment that was actually a payment reminder rather than an invoice. The escalation rule fired so often that the bookkeeper had built a habit of opening Make.com first thing in the morning and clearing the queue by hand. That habit was the real cost. It was 35 to 50 minutes a day, every day, for someone who bills out at €95 an hour.

What an agent changes about this problem

A regex either matches or it doesn't. An LLM can read a receipt the way a human reads one: it sees that the bold number at the top is the invoice number, that the line saying BTW 21% is VAT, that the small print under the logo is the supplier even if the trading name on the bank line is different. It can also notice when an attachment isn't an invoice at all, which the old scenario could not.

What an agent does not change is the boring part: you still need a deterministic way to write to Twinfield, you still need a place to put files that the firm's auditors can inspect later, and you still need an audit log a human can read in plain language. We kept those parts boring on purpose.

Takeaway

The win wasn't replacing a workflow tool with an LLM. It was replacing nine fragile glue stages with one reading step, and leaving the writing step deterministic.

The new shape

One inbox, one agent process, three tools the agent is allowed to call. The agent reads an incoming email, extracts the invoice, decides what to do, and either files it or asks the bookkeeper a specific question. No branches on a canvas, no regex.

The tool surface we exposed to the model is small. Small tool surfaces are how you keep an agent honest.

// tools.ts — the only things the agent can do
export const tools = [
  {
    name: "lookup_supplier",
    description: "Find a supplier in Twinfield by name, IBAN, or VAT number. Returns ledger code and default cost account, or null.",
    input_schema: {
      type: "object",
      properties: {
        name: { type: "string" },
        iban: { type: "string" },
        vat_number: { type: "string" },
      },
    },
  },
  {
    name: "file_purchase_invoice",
    description: "Create a purchase transaction in Twinfield. Idempotent on (supplier_code, invoice_number).",
    input_schema: {
      type: "object",
      required: ["supplier_code", "invoice_number", "invoice_date", "total_incl_vat", "vat_lines", "cost_account"],
      properties: {
        supplier_code: { type: "string" },
        invoice_number: { type: "string" },
        invoice_date: { type: "string", format: "date" },
        total_incl_vat: { type: "number" },
        vat_lines: {
          type: "array",
          items: {
            type: "object",
            properties: {
              rate: { type: "number" },
              base: { type: "number" },
              vat:  { type: "number" },
            },
          },
        },
        cost_account: { type: "string" },
        attachment_drive_id: { type: "string" },
      },
    },
  },
  {
    name: "ask_bookkeeper",
    description: "Reply to the original email with a specific question. Use only when a field cannot be determined and is required.",
    input_schema: {
      type: "object",
      required: ["question", "fields_in_doubt"],
      properties: {
        question: { type: "string" },
        fields_in_doubt: { type: "array", items: { type: "string" } },
      },
    },
  },
] as const;

Three tools. That's the whole API the model gets. Everything else (Drive upload, PDF parsing, Twinfield SOAP envelope, audit log) runs in code around the agent. The model decides; the code acts.

The Twinfield part nobody warns you about

Twinfield's public web services are SOAP. They have been SOAP for a long time. Authentication goes through OAuth 2.0, which is fine, but the actual transaction calls speak XML and they care about whitespace, namespace prefixes, and the exact shape of the <office> element. We did not let the LLM anywhere near this. The file_purchase_invoice tool is a typed function that builds the XML from validated fields. If the XML is wrong, it's wrong because we wrote it wrong, not because the model hallucinated a tag.

Idempotency mattered more than we expected. The same invoice can land in the inbox twice: once forwarded by the supplier, once by the office manager who wasn't sure if the supplier had sent it. The tool keys on (supplier_code, invoice_number) and returns the existing transaction id on a duplicate rather than creating a second one. This single decision removed an entire class of "why is this invoice booked twice" questions from the bookkeeper's week.

Warning

If you wire an agent into Twinfield without idempotency, you will book duplicate entries. The SOAP API will happily accept them. Your client will not happily un-book them.

What 47 seconds actually means

The number in the title is the median wall-clock time from email arrival to booked transaction, measured over four weeks of production traffic. It breaks down roughly like this: about 9 seconds for attachment fetch and PDF-to-image conversion, about 22 seconds for the model to read the invoice and call lookup_supplier, about 11 seconds for the Twinfield SOAP round-trip, and about 5 seconds for the confirmation reply and Drive archival. The model is not the slow part. Twinfield is the slow part, and it has been the slow part since 2003.

The number that actually matters to the firm is different. The bookkeeper used to spend 35 to 50 minutes a day clearing the queue. She now spends about 6 minutes a day answering specific ask_bookkeeper questions from her phone on the way to work. Most of those questions are about a missing cost-account on a brand-new supplier, which is a decision only a human should make.

Where the agent still asks for help

We did not try to drive the escalation rate to zero. We tried to make every escalation worth a human's attention. Three categories survived:

New suppliers. The first invoice from a supplier the firm has never seen needs a ledger code and a cost account chosen by a person. The agent extracts everything else and presents a one-click reply.
Ambiguous documents. Credit notes, pro-forma invoices, and statements-of-account look like invoices but aren't. The agent flags them with the specific reason.
VAT that doesn't reconcile. When the sum of VAT lines doesn't match the printed total within €0.02, the agent refuses to file and asks. This is almost always a scanning artefact, but "almost always" is not a good enough reason to silently book the wrong amount.

What we threw away from the Make.com version

The Google Sheet of supplier-to-ledger mappings is gone. Twinfield already knows this. We were maintaining a parallel copy because the old scenario found it easier to read from Sheets than to call the SOAP API. The agent calls the SOAP API. One source of truth.

The regex stage is gone, obviously. So is the branch for "known vs. new supplier" (the agent handles both paths in the same prompt). So is the dated-folder Drive structure, which we replaced with a flat archive keyed by Twinfield transaction id. Auditors find things faster when the filename matches the entry they're looking at.

The runtime, briefly

The agent runs on a small Node process behind the firm's existing mail server. It does not run in a workflow tool. There is a single loop: poll inbox, dispatch one message at a time to the agent, write the result to an append-only audit log on disk. The audit log is plain JSONL and the bookkeeper can grep it. This sounds obvious until you've tried to explain to a non-engineer why a particular invoice was filed against the wrong account in a no-code tool's run history.

Anthropic's writeup on building effective agents captures the shift well: the interesting work in agent systems is no longer the model call, it's the small deterministic loop around it. Our loop is 180 lines of TypeScript. It has been edited twice since we shipped it.

What this cost, and what it cost to run

Build was four weeks, two of which were spent reverse-engineering the Twinfield SOAP edge cases and writing a test rig that replays real invoices against a sandbox office. Model spend runs about €0.04 per filed invoice at current rates. The firm processes around 1,400 invoices a month, so the model bill is about €56. The freed bookkeeper hours are worth roughly €3,200 a month at her billing rate. The math is not subtle.

Closing

When we built the email agent for this firm, the thing we kept running into was Twinfield's tolerance for malformed SOAP and zero tolerance for duplicates. We solved it by keeping the writing step deterministic and idempotent, and letting the model only do the reading. If you have a similar shape of problem (an inbox, a legacy system of record, a person clearing a queue every morning) that's the pattern we'd start from. We build this kind of email agent for European SMEs.

The smallest thing you can do today: open your own Make.com or Zapier dashboard, sort scenarios by error count over the last 30 days, and write down the name of the person who clears those errors by hand. That person is the project.

Key takeaway

Replace the reading layer of your finance automation with an agent, keep the writing layer deterministic and idempotent, and the queue stops needing a human babysitter.

FAQ

Why not keep Make.com and just add an LLM step?

We tried. The handoff between Make.com modules and an LLM step still left the regex extractor and the branch logic in place. Replacing the whole reading layer with one agent was simpler than patching nine stages.

Does the agent ever book the wrong invoice?

Not yet, over four weeks and roughly 1,400 invoices. The idempotency key and the VAT reconciliation check catch the cases where it would have. When in doubt, the agent escalates rather than guesses.

Is Twinfield's SOAP API hard to work with?

It's old, not hard. OAuth 2.0 for auth, XML for transactions, strict about element order. Build a typed wrapper once, test it against a sandbox office, and you won't touch it again for months.

Could a non-technical bookkeeper maintain this?

They maintain the supplier defaults in Twinfield directly, which is where they were going to maintain them anyway. The agent code itself is ours to maintain. The audit log is plain JSONL they can grep.

What happens when Twinfield is down?

The agent finishes the read step, queues the file_purchase_invoice call, and retries with backoff. The confirmation email goes out only after the SOAP call succeeds, so the bookkeeper never sees a false confirmation.

email automationai agentscase studyworkflowintegrationsoperations

Building something?

Start a project