Automation

Durable invoice agents: n8n vs Make vs Temporal stack

We costed three durable-execution runtimes for a factuur-agent at 4,600 weekly inkoopfacturen against the seven-year belastingdienst bewaarplicht. The cheapest answer wasn't the right one.

Jacob Molkenboer· Founder · A Brand New Company· 18 Jun 2026· 9 min

Linen envelope with green ribbon on dark leather blotter, three brass relay switches, pneumatic tube canister, paper docket.

The financial controller of a 20-person sheet-metal works in Apeldoorn opens her shared mailbox on a Monday morning and sees 4,600 inkoopfacturen waiting for her week. Two are duplicates. Eleven are denominated in CHF. One is a credit note disguised as an invoice. Forty-three are PDFs that her current OCR rejects on sight. She wants a factuur-agent. She wants it to read, validate, deduplicate, code to the right grootboekrekening, and push the result to Twinfield. She also wants a step-by-step replay of every decision the agent made, retained for seven years, in case the belastingdienst comes calling.

That last requirement is what killed three of our shortlist tools before we wrote a line of production code.

This is a comparison of the three runtimes we costed for that build: n8n self-hosted, Make.com on a Teams plan, and a hand-rolled Temporal + TypeScript stack. The criteria were narrow and unforgiving — per-workflow cost at 20,000 invoices a month, replay-defensible step history for the seven-year bewaarplicht, and who gets paged when Twinfield rotates its OAuth refresh-token at 02:00 on a Sunday.

What the agent actually does

Before runtimes, the work. One invoice end-to-end:

Pull from IMAP or the Twinfield UBL inbox.
Classify (factuur / creditnota / aanmaning / spam).
OCR plus structured extraction with a vision model, fallback to a rule extractor for known suppliers.
Deduplicate against a 90-day Postgres window keyed on (supplier, invoice_number).
Validate VAT line items, foreign-currency conversion at invoice date, BTW-code, grootboek mapping.
Push to Twinfield via the SOAP API, capture the document GUID.
If approval is required (above €2,500 or a new supplier), route to Slack with an Approve / Reject card.
Persist the full decision trail.

On average, eleven discrete steps per invoice, with two external APIs that can fail (Twinfield, the LLM), one human-in-the-loop branch, and a hard requirement that nothing gets lost between steps. The runtime layer's job is to keep that pipeline durable across crashes, restarts, deploys, and rate-limit storms.

n8n self-hosted: cheap until it isn't

We started here because it's where most teams start. n8n is a node-based workflow engine — self-host it on a €40/mo VPS, drag your IMAP node onto the canvas, wire it to an LLM node, then to a Postgres node, then to an HTTP node for Twinfield.

For the happy path, it works. We had a proof-of-concept eating 200 invoices in an afternoon within two days.

The trouble started in three places.

Execution data retention. n8n writes every execution to its own database. You can configure pruning and an age cap (see the docs), but if your bewaarplicht is seven years, you cannot prune. At 20,000 invoices × 11 nodes per execution × tens of kilobytes of node IO each, you are writing roughly 11 GB per month of execution data that you can never delete. After year seven you are sitting on close to a terabyte of operational database, and your Postgres backups are now the rate-limiting step in every deploy.

Replay is not deterministic. When the belastingdienst asks why invoice 18293/A was coded to grootboek 7000 in March 2027, you cannot re-run that node in isolation. n8n stores the inputs and outputs of each node, but the code that runs the replay is whatever code is on disk today. If you upgraded the workflow between then and now, what you replay is not what ran. You can persist the JSON of the workflow at execution time as a side-effect, but you have to write that yourself, and you have to write the loader that reconstructs it.

The 02:00 page. n8n's built-in retry is per-node. When Twinfield rotates the OAuth refresh-token at 02:00 on a Sunday, every in-flight execution fails the Twinfield node. n8n retries three times, fails, and the workflow ends in error. There is no global "pause this branch and resume when the token is back" primitive. Someone has to log in to the editor, replay failed executions one by one, and hope no invoice was lost between IMAP delete-on-read and the failed write.

Per-workflow cost was low: about €0.004 per invoice including the VPS and the Postgres tier. The on-call cost was not low. Sunday morning was now part of the job description.

Make.com: the operation-counter problem

Make sells reliability as a feature, and for small flows it delivers. The model is different from n8n: every module inside a scenario costs you one operation, and you buy operations in monthly buckets.

We modelled the same eleven-step pipeline. At 20,000 invoices a month, that is roughly 220,000 operations baseline, plus retries, plus the Slack approval branch on the ~15% of invoices that hit it, plus the deduplication lookup. Closer to 320,000 operations a month at steady state, with peaks above that during invoice-heavy weeks.

To cover 350,000 operations on a Teams plan we were looking at the higher tier with a per-invoice cost around €0.03 — almost ten times the n8n number. Make's pricing rewards small scenarios and punishes high-volume ones.

The retention story is similar to n8n: execution logs are retained for the plan window (30 to 90 days on Teams), and seven-year defensibility is not part of the product. You can export, but you are exporting JSON to your own bucket, which means you might as well have written the agent on a runtime you control.

The 02:00 problem is partially solved by Make's incremental retry and queue, but the same hole exists: if the OAuth refresh-token rotation needs a one-time interactive step (Twinfield occasionally does, for new scopes), the scenario hangs rather than fails, and Make pages no one — because Make does not know that this particular HTTP 401 is special.

Warning

If you are buying a hosted workflow runtime to satisfy the seven-year Dutch bewaarplicht, read the vendor's data-retention SLA before signing. Most cap operational logs at 30 to 90 days. The retention you actually need has to live in your warehouse, not theirs.

Temporal + TypeScript: the runtime that is also the archive

Temporal is a different shape of tool. It is not a no-code canvas. It is a durable-execution engine that runs your TypeScript (or Go, Java, Python) workflow code and persists every step to its event history. Event history is the central abstraction: every input, every output, every retry, every signal, every timer. The workflow code is deterministic, and the event history is sufficient to replay the entire execution, byte for byte, against the same code version that ran originally.

The runtime hands you the bewaarplicht artefact for free. Pin the workflow code version into the event, store the event history for seven years (it is a flat protobuf log, compresses well — we measured roughly 1.4 KB per invoice after gzip), and the belastingdienst question becomes a replay command.

import { Worker, NativeConnection } from '@temporalio/worker'
import * as activities from './activities'

const worker = await Worker.create({
  connection: await NativeConnection.connect({ address: 'temporal:7233' }),
  namespace: 'factuur',
  taskQueue: 'invoices',
  workflowsPath: require.resolve('./workflows'),
  activities,
})

await worker.run()

The workflow itself reads almost like the prose version of the pipeline.

import { proxyActivities, condition, setHandler } from '@temporalio/workflow'
import { approvalSignal } from './signals'
import type * as acts from './activities'

const a = proxyActivities<typeof acts>({
  startToCloseTimeout: '2 minutes',
  retry: { maximumAttempts: 5, initialInterval: '4s' },
})

export async function processInvoice(raw: RawInvoice) {
  let decision: 'approve' | 'reject' | null = null
  setHandler(approvalSignal, (d) => { decision = d })

  const cls = await a.classify(raw)
  if (cls.type !== 'invoice') return a.archive(raw, cls.type)

  const extracted = await a.extract(raw)
  if (await a.isDuplicate(extracted)) return a.archive(raw, 'duplicate')

  const coded = await a.codeToGrootboek(extracted)
  if (coded.requiresApproval) {
    await a.requestSlackApproval(coded)
    await condition(() => decision !== null, '7 days')
    if (decision === 'reject') return a.archive(raw, 'rejected')
  }

  return a.pushToTwinfield(coded)
}

The OAuth refresh-token at 02:00 problem becomes pleasant. The Twinfield activity catches the 401, raises a non-retryable application error tagged TWINFIELD_AUTH_EXPIRED, and a separate workflow — a long-running tokenSentinel — wakes up, runs the refresh, and signals every paused invoice workflow to retry. No one gets paged. Invoices that were mid-flight at 02:00 finish themselves at 02:04.

Cost. Self-hosted Temporal on a three-node Postgres-backed cluster runs us about €180/mo of infrastructure at this scale. Temporal Cloud, which we used for the first six months while we hardened the cluster, bills by Action; their pricing at 20,000 invoices a month came in under $40 including a small base fee. Per-workflow cost is €0.0075 on owned infra, €0.0018 on Cloud — cheaper than Make, slightly more expensive than n8n by absolute number, both with the seven-year archive baked into the runtime instead of bolted on.

Per-workflow cost, side by side

At 20,000 invoices a month, all-in:

n8n self-hosted: ~€80/mo runtime, ~€0.004 per invoice, with an unsolved seven-year retention problem and a Sunday-morning pager.
Make Teams (higher tier): ~€600/mo runtime, ~€0.03 per invoice, with a 90-day log retention you have to mirror out.
Temporal self-hosted: ~€180/mo runtime, ~€0.0075 per invoice, seven-year event history as a first-class feature.
Temporal Cloud: ~€35/mo runtime, ~€0.0018 per invoice, same archive guarantees, no cluster to operate.

The cheapest per-invoice number is n8n. The cheapest total cost of ownership, once you price the on-call hours and the side-channel retention work, is Temporal. We have run that spreadsheet three different ways for three different clients and the answer comes out the same whenever the bewaarplicht is in play.

Who patches the workflow at 02:00

This is the unglamorous question that decides the build.

On n8n, the answer is: you do. The editor is the only place where you can replay a failed execution, and there is no programmatic way to detect "Twinfield's refresh-token rotated, pause everything depending on it." You write that as a separate cron, and when it misfires, the workflow editor is your incident-response tool. We tried this for three weeks and burned through fourteen developer-hours of weekend on-call before we moved.

On Make, the answer is: you, plus Make's status page. The 02:00 problem manifests as a hung scenario, not a failed one — incremental retry keeps trying. Finance sees it Monday morning when they cannot reconcile.

On Temporal, the answer is: nobody, most of the time. The tokenSentinel workflow is itself durable, it runs every fifteen minutes, it owns the refresh, and it signals the invoice workflows to continue. The one time we did get paged in nine months of operation was when Twinfield changed the shape of the refresh response — a code-level problem, not a runtime problem, and Temporal had every failing workflow paused safely waiting for us to ship the fix.

What we shipped

We built this on Temporal Cloud for the first six months and migrated to self-hosted Temporal on the client's Hetzner cluster once volume was stable. The factuur-agent now processes between 4,200 and 5,100 invoices a week, with a 96.4% straight-through rate and a 1.8-day median time-to-Twinfield. The seven-year archive is a single S3 bucket of gzipped protobuf, billed in cents per month.

When we built this AI agents stack for the Apeldoorn factory, the thing we kept running into was that the choice of durable-execution layer mattered more than the choice of LLM. We ended up rewriting the prompt twice and the runtime never.

If you are scoping a similar build, the five-minute audit is this. List the steps in your pipeline. Multiply by your monthly volume. Then ask your candidate runtime two questions: how long does it keep the execution trace, and is the code that ran still the code you can re-run. If either answer is unsatisfying, stop costing and pick a different tool.

Key takeaway

The cheapest per-invoice runtime is rarely the cheapest total-cost runtime once you price seven-year retention and the on-call hours of OAuth rotations.

FAQ

Is Temporal overkill for a 20-person company?

For a single low-volume workflow, yes. For one with seven-year audit retention or external APIs that rotate credentials, the runtime pays itself back inside a quarter on saved on-call hours alone.

Can n8n satisfy the Dutch belastingdienst bewaarplicht on its own?

Not directly. n8n's execution database is operational, not archival. You would need to mirror execution data into long-term storage and version-pin workflows yourself before a tax audit lands.

Why not Zapier for this workload?

At 20,000 invoices a month the per-task pricing on Zapier comes out a multiple of Make's number, and the same retention and OAuth-rotation gaps apply. The constraints that ruled out Make also rule out Zapier.

What did Temporal Cloud cost at this scale?

Under $40 a month at 20,000 invoices, billed by Action. Self-hosted on Hetzner came in around €180 a month including the Postgres tier, monitoring, and backups.

Do we need to write TypeScript to use Temporal?

No. Temporal supports Go, Java, Python, .NET, and PHP SDKs in addition to TypeScript. We picked TypeScript because the rest of the agent stack was already in it.

automationai agentsintegrationsarchitectureworkflowcase study

Building something?

Start a project