Integrations

Invoice OCR field guide: fifteen quirks that bite agents

A bookkeeper finds nine invoices booked as cover sheets only. The intake agent is fine. The OCR export is not. Here are the fifteen quirks that cause it.

Jacob Molkenboer· Founder · A Brand New Company· 15 Jun 2026· 9 min

Cream invoice with folded corner, brass paperclip, green silk thread, red wax seal on ivory paper surface.

It is a Tuesday morning at a mid-sized Amsterdam accountancy firm. The bookkeeper opens nine invoices the intake agent posted to Exact over the weekend. Eight look fine. The ninth is a cover sheet from KPN, with a polite note about your bill being attached. Amount: zero. Line items: none. The actual invoice sits on pages two and three of the same PDF.

The agent did not hallucinate. Klippa returned a clean JSON object for page one and stopped, because the cover page was classified as a separate document inside that PDF and pages two and three got assigned to a sibling object the agent never asked for.

This is the worst kind of failure. It does not throw. It does not warn. It posts a junk row to the ledger, and you find out in month-end close when a finance lead asks why the phone bill says zero.

We have rolled out invoice-intake agents on top of Klippa, Basecone, and DigiOffice at accountancy firms in Utrecht, Eindhoven, and across the border in Antwerp. Fourteen agents are live in production at the time of writing. Below are the fifteen export quirks that bite every rollout, ranked by how silently they corrupt the ledger.

The ordering matters. Quirks that throw an error are fine; you catch them and queue them. Quirks that quietly drop a field or truncate a document are the ones that turn an OCR pipeline into a slow leak nobody notices for a quarter.

Tier 1: silent truncation of multi-page PDFs

Three of the fifteen sit at the top because they cost real money before anyone sees the symptom.

1. Klippa's multi-document split. When you upload a PDF that Klippa's classifier thinks contains more than one document (cover letter plus invoice, statement plus invoice, two invoices stapled together), the default API response wraps each in a separate document object inside documents[]. If your agent reads only documents[0], you booked the cover sheet and ignored the invoice. The fix is to iterate. The trap is that the field exists at all and most starter code in their dashboard reads index zero.

2. Basecone's max-pages-per-stream cap. Standard-tier accounts cap streaming OCR at 30 pages per document. Anything beyond is parked in a review queue and never streams to the agent. We have seen sixty-page consolidated supplier statements vanish into that queue for three weeks because the only signal was a tab in the UI that nobody opens.

3. DigiOffice ZIP exports use Windows backslashes in the manifest paths. On a Linux extraction (which is most agent runtimes), the file gets created as the literal string invoices\2026\06\KPN-inv.pdf in the unzip root, not nested. Your agent reads the manifest, looks for invoices/2026/06/KPN-inv.pdf, and skips it. The PDF is right there, under a different name, in a different directory.

Tier 2: the silent field drop

These do not lose a page. They lose a column. The rest of the row passes every check.

4. Basecone drops the IBAN when the BIC field is empty. This is the headline one. Their export serialiser treats IBAN and BIC as a payment-pair; if BIC is blank (very common for SEPA payments inside NL, where BIC is optional under ISO 13616), both fields collapse to null in the JSON. The data is in their UI. It just is not in the export. We caught this by reconciling against the raw PDF text and finding 7% of supplier rows missing the IBAN despite OCR having extracted it correctly.

Warning

Test the Basecone export against a SEPA invoice with no BIC before you let it touch the ledger. The dropped IBAN is invisible in the JSON and the field is filled in their web app, so a manual spot check will not catch it.

5. Klippa truncates supplier names at 35 characters in the supplier.name field, but the full string is in supplier.raw_name. The starter code reads the first one. Your chart of accounts then gets fuzzy-matched against the truncated string, and new supplier records get created where existing ones should have matched. After a year you have three rows for the same telecom company, none of which reconcile.

6. DigiOffice exports VAT rates as strings like "21%" with a literal percent sign, sometimes with a trailing space ("21% "). If your agent casts to float without stripping, every other invoice fails the VAT total cross-check.

Tier 3: schema drift on the edge paths

These quirks do not fire on the happy path. They fire when an invoice hits a side branch in the vendor's pipeline, and the schema changes underneath you.

7. Klippa changes the JSON shape on manual review. The line_items field, normally an array of objects, becomes an object keyed by line index when a reviewer adds a comment. Same endpoint, same auth, different shape. If your agent uses a strict schema validator, it will throw. If it uses a permissive one, it iterates the object and gets keys instead of items, and the line totals end up empty.

8. Basecone's webhook fires twice for invoices over ~3MB. The first fire is status: "received", the second is status: "extracted". The gap is 200ms to 4s depending on load. There is no ordering guarantee. If your agent processes on the first fire, the extracted fields are empty. If it processes on both, you double-book. Polling is safer than the webhook for anything above the size threshold.

9. DigiOffice retries failed OCR runs on the backend but does not increment a version field. You can GET the same document_id twice and receive different content. The only signal is the last_modified_at timestamp, which most agents ignore on a GET. Track it.

Tier 4: encoding, locale, and other small cuts

The remaining six are the small annoyances that produce off-by-one errors in financial reports.

10. DigiOffice CSV exports use CP1252, not UTF-8. Open one with Python's default open() on Linux and every euro sign becomes a mojibake stack. Specify encoding="cp1252" and the world stabilises. This is documented but you have to find it. See the CP1252 reference for the byte-for-byte differences with UTF-8 in the high range.

11. Basecone exports dates as DD-MM-YYYY for NL accounts and MM/DD/YYYY for accounts that have ever had a US user added, even temporarily. The user gets removed. The locale flip stays. We have only seen this once but the firm spent two days on it.

12. Klippa returns per-field confidence but does not flag when an entire page is rotated 90 degrees and the OCR is mostly noise. Confidence on individual fields can sit at 0.7 to 0.9 while the document is unreadable. Sum the confidences across the page; if the mean is below 0.85, look at the original.

13. DigiOffice CSV uses semicolon as the delimiter (Excel-NL default) but does not escape semicolons inside quoted description fields. Your parser will be off-by-one for any row with a description containing a literal ;. Supplier descriptions love semicolons.

14. Klippa's total_amount includes VAT. The net is in total_amount_excl. Both are returned. Pick the wrong one and you book gross as net for the year. We have seen this in production. The reconciliation only flagged it because the VAT-return automation downstream tried to recompute VAT on a number that already contained it.

15. Basecone flips the sign on credit-note VAT lines when the VAT code is one of three legacy codes (NL-NUL, NL-VRIJ, NL-NIH). The grand total reconciles only because the VAT line is zero anyway. The moment you have a non-zero VAT on a credit note (rare in NL but real for some hospitality clients), the export is silently off.

The five-minute audit you can run today

Pull the last sixty days of invoices through your pipeline and run three checks per row. You do not need to fix anything yet. You need to know what your hit rate is.

import re

def audit_invoice(extracted, source_pdf_text, source_page_count):
    issues = []

    # Tier 1: did we read all pages?
    if extracted.get("page_count", 0) < source_page_count:
        issues.append("truncated")

    # Tier 2: critical fields present in source but missing in export?
    iban_in_source = re.search(r"NL\d{2}[A-Z]{4}\d{10}", source_pdf_text)
    if iban_in_source and not extracted.get("iban"):
        issues.append("iban_dropped")

    # Tier 4: does the total reconcile?
    try:
        net = float(str(extracted["total_amount_excl"]).rstrip("% ").strip())
        vat = float(str(extracted["vat_amount"]).rstrip("% ").strip())
        gross = float(str(extracted["total_amount"]).rstrip("% ").strip())
        if abs(net + vat - gross) > 0.02:
            issues.append("reconciliation_break")
    except (KeyError, ValueError):
        issues.append("unparseable_amounts")

    return issues

One hundred invoices, three checks. If your truncated or iban_dropped count is above 2%, you have a Tier 1 or Tier 2 quirk in the wild and a number that finance will care about. If your reconciliation_break count is above 5%, you are probably hitting a Tier 4 amount-field confusion.

Why these quirks exist in the first place

The pattern across all fifteen is the same. Each vendor built an export schema around a UI workflow, not around an automated reader. The manual-review array becomes an object because that is how the UI renders comments. The IBAN drops because the UI form treats IBAN and BIC as a single widget. The cover sheet splits because the UI lets a reviewer reclassify each page.

None of this is fixable on the vendor's side without breaking their customers' existing exports. The fix lives between the export and the agent: a small reconciliation layer with one job. Prove that what the agent is about to write matches what is on the source PDF, and queue anything that does not.

When we built the invoice-intake agent for a Utrecht accountancy firm last quarter, the thing we ran into was the Basecone IBAN-drop on 7% of supplier rows. We ended up adding a single regex pass over the source PDF text post-extract, comparing against the export, and routing deltas to a human queue. Five hours of work caught a six-figure annual booking error. That kind of glue is most of what our AI agents work looks like in production.

Today, pull one suspicious invoice from last week. Diff the export JSON against the raw PDF text by hand. You will know within ten minutes whether you have a Tier 1 problem or a Tier 4 problem, and that tells you which week of work to plan for.

Key takeaway

Reconcile your OCR vendor's export against the raw PDF text. Silent truncation and silent field drops kill more invoice-intake agents than bad OCR ever did.

FAQ

Which OCR vendor handles multi-page PDFs best out of the box?

None of them are bug-free. Klippa has the cleanest schema but the split-document trap is real. The fix is a page-count reconciliation step, not a vendor switch.

How do we detect the Basecone IBAN drop without reading every PDF?

Regex the source PDF text for the Dutch IBAN pattern. If a match exists and the export's iban field is null, flag the row to a human queue.

Is this list specific to Dutch accountancy or does it apply to other markets?

Most are vendor behaviours, not market quirks. Klippa is Dutch but used internationally. DigiOffice's CP1252 trap hits anyone outside Western Europe.

Why not just call the vendor and report the bugs?

We have. A few have been fixed over the years. Most are documented as 'by design' because changing them would break existing customer exports. Build the reconciliation layer.

How often should we run the audit script in production?

Daily for the first month after going live, then weekly. Vendor schemas change without notice when they ship features, and the first signal is usually a reconciliation drift, not an error.

integrationsai agentsautomationprocess automationworkflowoperations

Building something?

Start a project