AI agents

Codex agent broke our VAT rounding: a Symfony 6 post-mortem

At 02:14 a finance controller in Breda opened seven-cent invoice mismatches on every B2B order above €10,000. The Codex agent we left running overnight had been working for eight hours.

Jacob Molkenboer· Founder · A Brand New Company· 16 Sept 2024· 9 min

Brass relay switch tripped open beside a folded paper invoice with a green sticky note and red wax pencil on ivory.

At 02:14 on a Tuesday morning, the finance controller at a wholesaler in Breda opened her inbox to a flag from one of their largest B2B customers. The invoice for that day's order, €14,820.34, was off by seven cents against the customer's purchase order. She checked three other invoices over €10,000. All wrong, all by a handful of cents, all on B2B orders.

The Codex agent had been running for just over eight hours.

The ticket we gave it

The client runs a Symfony 6 backoffice that bills around 900 B2B customers a month. Their invoice module was the last legacy island in the codebase: a 1,200-line service class written against Symfony 4 conventions, full of static utility calls, mixed float and integer arithmetic, and a custom VAT calculator that nobody wanted to touch. The ticket was modest. Lift the module to modern Symfony 6 patterns, extract the calculator into its own service, and bring it under the existing test suite. We estimated four engineering hours. We scheduled the orchestrated agent overnight.

The orchestrator we use is the boring shape that has been making the rounds on Hacker News this month: a planner that decomposes the ticket, an implementer (Codex) that writes code one file at a time, a reviewer that reads each diff, and a test runner that gates merges. The implementer is the only agent that touches the working tree. Each step writes its reasoning to a journal so a human can read what happened the next morning.

What Codex actually changed

The morning journal told a coherent story. The agent had inventoried the legacy class, written a refactor plan, extracted a VatCalculator service, replaced the static helpers with constructor-injected dependencies, and rewritten the invoice aggregation to use a Money value object. It even pulled in moneyphp/money as a dependency, which was on our wishlist for that codebase. 1,193 lines rewritten across nine files. Every test green.

The bug lived inside one file: VatCalculator::totalForLines().

Here is what the legacy code did, lightly trimmed:

// Legacy: rounds VAT per line, then sums the rounded line totals.
public static function totalForLines(array $lines): int
{
    $total = 0;
    foreach ($lines as $line) {
        $net = $line['qty'] * $line['unitPriceCents'];
        $vat = (int) round(
            $net * $line['vatRate'] / 100,
            0,
            PHP_ROUND_HALF_UP
        );
        $total += $net + $vat;
    }
    return $total;
}

And here is what Codex wrote:

public function totalForLines(array $lines): Money
{
    $net = new Money(0, new Currency('EUR'));
    foreach ($lines as $line) {
        $net = $net->add(
            Money::EUR($line->qty * $line->unitPriceCents)
        );
    }
    $vat = $net->multiply(
        $this->vatRate / 100,
        Money::ROUND_HALF_EVEN
    );
    return $net->add($vat);
}

Two changes, both defensible in isolation. The agent moved VAT calculation from per-line to per-invoice, and it switched from PHP's default PHP_ROUND_HALF_UP to the Money library's ROUND_HALF_EVEN (banker's rounding). On an invoice with twenty lines and a net above €10,000, the two choices together produce a different cent total than the old code roughly two times in three.

For our client, that meant their ERP, which still computed totals the old way for purchase-order matching, started failing reconciliation on every large B2B invoice from 19:30 onwards.

Why the tests stayed green

The existing suite had thirty-one tests for the invoice module. Twenty-eight were unit tests covering the calculator with line totals between €5 and €450. The other three were integration tests that ran a single small order through the full pipeline. None of them ever computed an invoice large enough for the rounding mode to matter.

The Codex implementer updated the existing tests when it changed the function signature from int to Money. It did not write new tests for the new behaviour, because the planner had not told it to: the ticket said "refactor", not "add coverage". The reviewer agent read the diff, noted the signature change, and confirmed the touched tests still passed.

No agent in the chain had the domain context to ask: is per-line rounding versus per-invoice rounding a customer-facing decision? Because nothing in the repo said it was.

Warning

An agent orchestrator will protect you from syntactic regressions and broken tests. It will not protect you from semantic decisions the original author made silently and never wrote down. Money code is full of those.

How the human review missed it

I want to be honest about this part because it is the bit that matters. The next-morning review took twelve minutes. Our engineer read the planner's summary, scanned the diff, saw a clean refactor with a sensible Money object adoption, and merged. He did exactly what the orchestrator was designed to let him do.

Twelve minutes is what we budget internally for a clean refactor PR with no behavioural notes flagged by the reviewer agent. It works on roughly 95% of the agent's output. The other 5% is the reason posts like this one get written, and it is almost always a semantic decision dressed up as a stylistic one.

The journal flagged the rounding-mode change as a "design choice" in a footnote. It did not flag it as a fiscal one. PHP's own documentation happily presents both modes as valid; nothing on that page tells you the Dutch tax authority cares which one you use, or that your customer's ERP is hardcoded to expect the other.

The 47-minute fix

At 02:51 we reverted the merge, restored the per-line summation, and shipped a hotfix that kept the new Money-typed return value but matched the old arithmetic exactly:

public function totalForLines(array $lines): Money
{
    $total = Money::EUR(0);
    foreach ($lines as $line) {
        $net = Money::EUR(
            $line->qty * $line->unitPriceCents
        );
        $vat = $net->multiply(
            (string) ($line->vatRate / 100),
            Money::ROUND_HALF_UP
        );
        $total = $total->add($net)->add($vat);
    }
    return $total;
}

Then we wrote eleven new tests, all on invoices between €9,800 and €18,000, line counts from three to forty-two, both 21% and 9% Dutch VAT brackets. Two of those tests would have failed against the agent's version. We sent the client a corrected re-invoice script for the seventeen affected orders by lunchtime. Nobody had paid yet; nothing escalated.

We also paused the overnight agent queue for forty-eight hours while we wrote the property tests described below, which delayed three unrelated tickets by a day. That trade was easy to make at 03:30 in the morning.

What we changed in the orchestrator, not the agent

The temptation after this kind of incident is to blame the model. Codex did exactly what we asked it to: refactor a module, keep tests green, write idiomatic code. The failure was upstream of the agent, in the orchestrator and in the ticket.

Four changes went in the week after:

A money-domain checklist attached to every ticket that touches invoicing, pricing, or VAT. It lists the eight things a refactor must preserve: rounding mode, rounding granularity (per-line vs per-invoice), currency precision, tax-inclusive vs tax-exclusive line storage, discount application order, multi-currency conversion timing, negative-amount handling, and credit-note semantics. The planner agent reads it before decomposing. We drafted the first version with the client's finance team in a forty-minute call. They knew every one of those invariants by heart and were quietly surprised nobody had ever asked them to write them down.
A property-test stage for any module the planner tags as fiscal. The test runner generates 500 random invoices between €100 and €100,000 and compares totals to a frozen reference implementation. A diff of one cent fails the build.
A reviewer prompt with auditor framing. The reviewer agent no longer reads diffs as a senior engineer. It reads them as a tax auditor whose job is to find one number that changed. Same model, different brief, materially different output.
A hard rule that the implementer cannot change a return type from a primitive to a value object inside the same PR as a behavioural refactor. Those are now two PRs, always, with a human approval gate in between.

None of these are clever. They are the kind of guardrails a finance team would have asked for if anyone had thought to invite them to the ticket-writing meeting.

What this means if you are running agents against business code

The Hacker News thread on agent orchestration this week made the rounds because the shape it describes is starting to feel real: planner, implementer, reviewer, test gate, journal. We use exactly that shape. It works. It also leaks at the same place every orchestrator leaks: at the seam between syntactic correctness and domain meaning.

If your codebase has any of the following, treat them as fiscal modules and put them behind the heavier checklist before you let an agent near them: invoice and credit-note generation, VAT or sales-tax calculation, payroll, payout splits, refund flows, subscription proration, currency conversion. The agent will refactor them beautifully. It will also, given the chance, silently change one constant that the original author chose for a reason.

Takeaway

Agents fail at the seam between syntactic correctness and domain meaning. Put the domain rules in the ticket, not in the reviewer's head.

The five-minute audit you can run today

Open the file in your codebase that calculates invoice totals. Search it for round(, floor(, ceil(, ->multiply(, and any rounding-mode constant. For each hit, write one sentence in a code comment above it explaining which choice the original author made and why. If you cannot answer the "why" for any of them, you have just found the line your agent will quietly rewrite next month.

When we rebuilt this client's invoice module properly four weeks after the incident, the thing we wanted most was a place to put those sentences where the agent would actually read them. We ended up adding a docs/money-invariants.md file that the planner is required to cite in its decomposition. If you want help putting that kind of guardrail around your own AI agents work, that is the conversation we are usually having on the first call.

Key takeaway

Agents fail at the seam between syntactic correctness and domain meaning. Put the domain rules in the ticket, not in the reviewer's head.

FAQ

What is an orchestrator in agent engineering?

A small framework around the model. Typically a planner, an implementer, a reviewer, and a test gate. The agent writes code; the orchestrator decides what it sees, when, and what gets merged.

Why does VAT rounding only break on large invoices?

On small line totals the per-line vs per-invoice rounding difference is usually one cent and sums to near zero. Above €10,000 the drift can reach several cents and break ERP reconciliation.

Is per-line or per-invoice VAT rounding correct under Dutch tax law?

Both are permitted. The Belastingdienst requires consistency, not a specific method. The risk is changing methods silently mid-stream, which is exactly what happened here.

Could more tests have caught this before merge?

Yes. The suite had thirty-one tests but none on invoices large enough to expose the drift. Property-based tests on randomised amounts between €100 and €100,000 would have failed on the first run.

Should we stop using coding agents on legacy code?

No. The agent did the refactor faster and cleaner than a human would have. The lesson is to put domain invariants in the ticket and in the reviewer brief, not to remove the agent.

ai agentsautomationphplegacy sitescase studyoperations

Building something?

Start a project