Email automation

Invoice-chase agent playbook: 1,420 threads, three systems

An Antwerp customs broker chasing 1,420 overdue invoice threads a week, a Cargonaut feed from 2007, and an Exact Online ledger. Here is the playbook we shipped.

Jacob Molkenboer· Founder · A Brand New Company· 12 Jun 2026· 9 min

Cream envelope with chartreuse ribbon on dark green leather blotter, brass paperclip and twine-tied paper docket beside it.

Tuesday morning at the broker's office off the Antwerpse Ring. The finance lead, Inge, opens her week with the same number she has seen every Tuesday for two years: 1,420. That is the count of overdue invoice threads the team has to chase across three systems before Friday. Forty-seven days of average DSO. A spreadsheet with seven tabs. Two part-time assistants who do almost nothing except type vriendelijke herinnering into Outlook all day.

The company is a 23-person customs and logistics broker. They book containers, file declarations through Cargonaut (the Dutch customs message broker, in production since 2007), and bill clients through Exact Online. Customer service runs on Freshdesk. The same client can have ten paid invoices, three overdue, and one open ticket about a missing CMR. The assistants do not always know which is which. When the system gets it wrong, a customer who is already arguing about a €4,200 demurrage line gets a third reminder for it that afternoon. The phone rings. The relationship ages a year in ten minutes.

We built them an invoice-chase agent. This is the playbook.

The three systems that own the truth

Before any agent gets written, you write down which system owns which fact. Get this wrong and the agent invents reality out of stale data.

For this broker:

Exact Online owns the invoice ledger, payment status, and remittance matching. If Exact says it is paid, it is paid.
Cargonaut owns the customs declaration status and the CMR / B/L references that link a shipment to an invoice line.
Freshdesk owns dispute state. A ticket tagged dispute, klacht, or betwist against an invoice number is the single source of truth for "do not chase this".

The agent never guesses. If it cannot match an invoice line to a Cargonaut reference, the thread goes to a human queue. If it cannot determine whether Freshdesk has an open dispute, it does not send. Refusing to act is a first-class outcome, not a fallback.

Takeaway

An invoice-chase agent that refuses to act when uncertain is worth more than one that acts confidently and wrong. The wrong reminder costs you the client, not the invoice.

Reading a 19-year-old Cargonaut feed without rewriting it

Cargonaut still ships EDIFACT messages over an SFTP drop. The structure has not meaningfully changed since the mid-2000s. The broker had a custom PHP parser, written in 2011, that nobody on the current team had touched. It worked. Replacing it was not on the table, and we did not want it to be.

We left it alone. The agent reads what the parser already writes into the broker's MySQL shipments table. The connection between an Exact invoice line and a Cargonaut declaration is the b_l_ref column. About 9% of invoice lines did not have a clean match, usually because the reference was typed by hand into the Exact line description. We handle those with a small classifier that looks for B/L-shaped strings in the description text ([A-Z]{4}\d{7,10} covers most carriers) and only matches if the carrier code lines up with the shipment's known carrier.

The matching step, simplified:

import re

BL_PATTERN = re.compile(r"[A-Z]{4}\d{7,10}")

def link_invoice_line(line, shipments):
    if line.bl_ref and line.bl_ref in shipments:
        return shipments[line.bl_ref], "exact"

    for ref in BL_PATTERN.findall(line.description or ""):
        ship = shipments.get(ref)
        if ship and ship.carrier_code == ref[:4]:
            return ship, "inferred"

    return None, "unmatched"

The inferred matches go through the agent but carry a confidence tag in the audit log. The None matches never get chased automatically. They land in a Monday review queue for Inge. She works through it with coffee.

The Freshdesk gate

This was the rule the client cared about more than any other: never send a second reminder to a customer whose dispute ticket is still open. A first nudge is fine. A second one, while the customer service team is actively negotiating, is what shreds a relationship.

The gate is one call to the Freshdesk API per candidate thread, filtered by the customer's company ID and tagged with dispute, klacht, or betwist. We cache the result for fifteen minutes, because Inge's assistant is not retagging tickets in real time, and an extra Freshdesk hit per reminder is not free at this volume. The cache key includes the customer ID and the invoice number. If the agent has never seen this combination this hour, it asks. Otherwise it uses the cached state.

def can_chase(customer_id, invoice_no, attempt, dispute_cache, freshdesk, log):
    if attempt == 1:
        return True  # First nudge is always allowed.

    key = (customer_id, invoice_no)
    state = dispute_cache.get(key)
    if state is None:
        state = freshdesk.has_open_dispute(customer_id, invoice_no)
        dispute_cache.set(key, state, ttl=900)

    if state.has_open_ticket:
        log.audit("blocked", customer_id, invoice_no, state.ticket_id)
        return False
    return True

The audit log matters more than the code. Every blocked send is recorded with the ticket ID that blocked it. When a controller asks "why did we not chase Hoyer last week", the answer is one query.

A short state machine, not a long prompt

The agent does not freestyle the chase. Each invoice has a state, and the agent moves it through a small machine:

due → reminder_1 → reminder_2 → final_notice → handover

The transitions are calendar-based, not model-based. Reminder 1 fires at +3 business days past due. Reminder 2 fires at +10. Final notice at +21. Handover to the human collections queue at +35. Letting the model decide when to chase is where these projects go off the rails. Calendar logic in Python. Language in the model. Not the other way around.

The language model only gets called at the body-of-the-email stage, and only with a tightly scoped prompt: the customer's name, the invoice number, the language preference, the shipment reference, and the attempt number. It returns Dutch, French, or English copy in the broker's house tone. It is not allowed to invent payment terms, propose discounts, or commit to anything. The system prompt forbids any sentence that begins with "I will" or "we can offer", and the output is checked against that regex before it leaves the building.

Handling replies without making a mess

Most write-ups of invoice agents stop at sending. The hard part is what happens when the customer replies. We saw three reply shapes that covered roughly 85% of inbound: "we already paid this on date X", "we dispute this for reason Y", and "please update the PO number on the invoice".

The reply handler is a separate worker that watches the chase mailbox and classifies into one of four lanes:

Paid claim. Open a reconciliation task in Exact, attach the original email, pause the chase for this invoice for 5 business days.
Dispute. Open a Freshdesk ticket tagged dispute linked to the invoice number, pause the chase indefinitely, notify Inge.
Admin correction. Route to a human, do not act.
Out-of-office or noise. Discard, do not bump attempt counter.

The classifier is a small model. The lane it picks becomes a Freshdesk side-tag, not an action, until a human approves the first hundred of each type. After that, lanes 1, 2, and 4 run autonomously. Lane 3 always needs a person. Inge approved the autonomy boundary in week four, on the back of a clean audit log.

Writing back to Exact Online without trashing the audit trail

Exact Online has a perfectly serviceable REST API. The two things to know are the OAuth refresh dance (tokens expire in 10 minutes, refresh tokens rotate) and the per-minute, per-division rate limit. The limits are tighter in practice than the public documentation suggests. We bucket our writes at 60 per minute per division and have not been throttled in production once.

The agent never edits an invoice. It only adds a notes entry and toggles a custom field, abn_last_reminder_at. Every reminder it sends is also logged as a sent-email entry attached to the customer in Exact, so the controllers see the chase history exactly where they expect to. The original spreadsheet still exists. Inge opens it on Mondays out of habit. It is now generated from the Exact custom field, not the other way around.

What changed in the numbers

Eight weeks after go-live, against the baseline month before launch:

DSO dropped from 47 days to 29. The customer's CFO had budgeted 35 as the optimistic case.
The two assistants are now doing one day a week of invoice follow-up instead of five. The other four days, they are on customs paperwork, which is what they were hired for.
Zero recorded incidents of a reminder being sent against an open Freshdesk dispute. Pre-launch, the assistants estimated this happened "five or six times a month". The CFO suspects it was more.
11% of threads still route to the human queue. We are not trying to push that to zero. The ambiguous ones genuinely need a human.

The DSO drop is the headline number. It is not the one Inge cared about most. She cared about the dispute-collision rate. She said in week three:

I have not had a single angry phone call about a reminder this month. I cannot remember the last time that was true.
Finance lead, Antwerp customs broker, May 2026

What we would do differently

Three things, in honesty.

We underestimated how much of the work was reconciling Cargonaut references that had been typed by hand into Exact. We thought the long tail would be 2%. It was 9%. The classifier ate two weeks we had not budgeted.

We over-engineered the language-model prompt at first. The initial version had a four-paragraph system prompt with examples. The version in production has a one-paragraph prompt and a regex gate on the output. The shorter prompt produces better Dutch. The model is not the agent. The state machine is the agent.

We should have shipped the audit dashboard in week one, not week five. The first month, every "why did we / why did we not" question turned into a ten-minute log dive. Once the dashboard was up, Inge stopped asking us and started asking the dashboard.

The smallest thing you could do today

Open your last 100 sent reminders. Cross-reference them against your customer-service ticket system. Count how many went out while a dispute ticket was open with the same customer for the same invoice. That number is your floor. It is the bare minimum your chase agent has to fix before it does anything else useful.

When we built this invoice-chase agent for the Antwerp broker, the hard part was not the language model. It was the Freshdesk gate and the Cargonaut reference matching. The agent that ships is the one that knows when to keep its mouth shut. If you are looking at the same shape of problem, the way we approach AI agents for finance operations is mostly the patterns above.

Key takeaway

The invoice-chase agent that ships is the one that knows when to keep its mouth shut. Calendar logic in code, language in the model, dispute state from the ticket system.

FAQ

Why not let the model decide when to send reminders?

Because the timing rules are calendar logic, not language. Keeping the state machine in code makes every send auditable, predictable, and cheap. The model only writes the body of the email.

How do you stop the agent from chasing a disputed invoice?

A Freshdesk lookup runs before every second-or-later reminder, scoped to the customer and invoice number. If a ticket tagged dispute, klacht, or betwist is open, the send is blocked and logged with the ticket ID.

Does it touch the original Cargonaut parser?

No. We read what the existing PHP parser already writes to the MySQL shipments table. Replacing a 19-year-old EDIFACT pipeline is its own project and was not on the table for this engagement.

How fast can a project like this go live?

For a broker of this size with Exact Online, Freshdesk, and a working customs feed, we plan eight weeks: two for read-only integration, two for the gate logic, two for staged rollout, two for the audit dashboard and reply handler.

What was the single biggest driver of the DSO drop?

Consistency. The assistants were chasing about 60% of overdue threads each week. The agent chases 100% of the eligible ones on day three, every time. The model copy barely matters.

email automationai agentsprocess automationintegrationscase studyoperations

Building something?

Start a project