Incident-walkthrough

n8n retry storm: anatomy of a €1,840 OpenAI burn

The dashboard froze at 02:14 on a Tuesday. The OpenAI bill kept climbing. Six hours and €1,840 in tokens later, the catalog-enrichment workflow was still retrying.

Jacob Molkenboer· Founder · A Brand New Company· 6 Jun 2026· 8 min

Brass spindle stacked with paper receipts on ivory blotter, one green carbon slip, tipped bell, pocket watch, ink linen.

The dashboard froze at 02:14

A Den Haag e-commerce team runs nightly catalog enrichment on around 12,000 SKUs. The workflow lives in n8n. It pulls new and changed products from Postgres, sends each one to OpenAI for a description rewrite and a category guess, writes the result back, and posts a heartbeat to a Slack-fed dashboard every 500 items.

At 02:14 on a Tuesday, the dashboard tile stopped updating. The Slack heartbeat said batch 4 of 24 complete. Nobody saw it. The operations lead opened her laptop at 08:30, refreshed the OpenAI billing page out of habit, and saw €1,840 spent since midnight. Their normal nightly burn is €11.

Six hours of quiet, paid retries. No alert fired. No email landed. The workflow was, in a sense, working perfectly: every node returned success, eventually.

This is the post-mortem.

The workflow that bit

The pipeline was nine nodes. The shape mattered:

Cron trigger, every night at 02:00.
Postgres node, pulls changed SKUs since the last run.
Split In Batches, size 50.
OpenAI Chat node, gpt-4.1, generates a description and a category guess.
JSON parse and validate.
Postgres update.
Heartbeat webhook on every fifth batch.
End of loop.
Error workflow trigger that, lovely detail, calls OpenAI again to summarise the failure for Slack.

The OpenAI node was configured with Retry On Fail: enabled, Max Tries: 5, Wait Between Tries: 1000ms. The defaults. The error workflow was an n8n community template the previous developer copy-pasted without changing. Bolted onto step 7 was an HTTP node that POSTed back to the same n8n instance to restart the workflow at the last successful batch index, in the name of resilience.

You can already see two of the three landmines.

Anatomy of the retry storm

Start with a single 429. OpenAI returns rate-limit errors for two reasons: tokens-per-minute and requests-per-minute. The team had a Tier 2 account, comfortable for batched calls of around 2,400 output tokens each, but breakable. Three batches of 50 SKUs back-to-back, and they crossed the per-minute window. You can verify the per-tier limits in the OpenAI rate-limit guide.

Here is what n8n does on a 429 with the defaults the team had:

The node fails.
Retry On Fail kicks in. n8n waits 1 second, retries. Five times.
The Split In Batches node does not catch it; the error propagates to the workflow level.
The Error Trigger workflow fires. That workflow calls OpenAI again to summarise the failure.
The HTTP self-trigger fires. A fresh workflow run starts at the last successful batch index.
The fresh run hits the still-saturated quota. Repeat.

There is a crucial billing detail buried in here. OpenAI charges for output tokens the model generated, even if your client times out or disconnects mid-stream. n8n's HTTP timeout in their workflow was 60 seconds. The model started, n8n killed the socket, the model finished generating anyway. n8n logged a failure. OpenAI logged the tokens. (n8n's own error-handling documentation warns about not double-handling errors, but says little about the cost of doing so against a paid endpoint.)

Token cost per cycle, roughly:

50 items per batch
  x ~2,400 output tokens per call
  x 5 node-level retries
  x ~30 self-restart cycles before the morning
= ~18 million billed output tokens

At gpt-4.1 output pricing of around €0.008 per 1k tokens, that lands in the ballpark of €1,800. Which matches what they saw.

The dashboard's blind spot

The Slack heartbeat fired from inside the Split In Batches loop, after a successful batch write. The first three real batches went through fine, before the workflow hit the TPM ceiling. After that, every dashboard refresh saw the same batch 4 of 24 heartbeat from an earlier self-restart, because the workflow kept restarting at batch 4.

The dashboard tile said batch 4 of 24, last update 02:14. It never updated, because the workflow was looping on batch 4. The on-call dashboard had no second signal: no per-minute throughput, no spend tracker, no is the workflow currently running widget. Slack saw silence and assumed all was well.

This is the kind of thing that fuels engineer scepticism about AI infrastructure. A recurring complaint in the HN front-page Ask HN thread about anti-AI sentiment this week was that the cost surface is opaque. This is what they mean. The workflow did not fail loudly. It failed expensively and silently.

The six-hour timeline

02:00:00  Cron fires. Workflow run #1 starts.
02:01:14  Batches 1, 2, 3 complete. Heartbeat: 3 of 24.
02:01:58  Batch 4 starts. TPM ceiling hit on item 27.
02:02:03  OpenAI node retry 1/5. 429.
02:02:08  Retry 5/5. Node throws.
02:02:09  Error branch fires. HTTP self-trigger.
02:02:09  Workflow run #2 begins at batch 4.
02:02:14  Workflow #2 hits same 429.
...       (this repeats, every ~12 minutes)
05:47:30  Workflow run #~28 hits same 429.
08:30:12  Operations lead opens billing dashboard. €1,840.
08:30:40  Workflow killed manually via n8n UI.

There was no single point of catastrophic failure. There were thirty small failures, each costing about sixty euros.

The fix, in three changes

We did three things, in order of importance.

1. Cap the spend at the source

OpenAI lets you set hard usage limits per project. The team had not. We split their workflow's API key into a dedicated project, set a hard limit of €50 per day, and let the model start returning 403s when the limit hits. A 403 is a real, loud failure. The controls are documented in OpenAI's production best practices guide; they take ninety seconds to configure and they would have ended this incident at €50 instead of €1,840.

2. Delete the self-restart loop

The resilient HTTP self-trigger went in the bin. We replaced it with a queue. Failed items get written to a failed_enrichment_queue Postgres table with their last error and a retry count. A separate workflow drains the queue at 06:00 with a fresh quota window. No live loop, no recursion, no surprise at sunrise.

3. Respect the 429

We turned Retry On Fail off for the OpenAI node and added a Code node that reads the Retry-After header and waits the requested amount, with jitter:

// Code node, runs after every OpenAI call.
// Honours Retry-After if present, otherwise backs off exponentially.
const err = $input.first().json.error;
if (err?.status === 429) {
  const retryAfter = Number(err.headers?.['retry-after'] ?? 0);
  const base = retryAfter > 0
    ? retryAfter * 1000
    : Math.min(2 ** ($itemIndex % 6) * 1000, 30_000);
  const jitter = (($itemIndex * 9301 + 49297) % 1024);
  await new Promise(r => setTimeout(r, base + jitter));
}
return $input.all();

Note the deterministic jitter via $itemIndex rather than Math.random(). n8n executions stay inspectable when the same input produces the same output, which is what you want when a workflow run becomes evidence.

Takeaway

The most expensive AI failures are not the ones that crash. They are the ones that keep retrying, quietly, on a budget you never capped.

The monitoring layer we added

A workflow that fails silently is worse than one that fails loudly. Three checks went in:

A token-spend heartbeat. Every fifteen minutes, a small workflow queries OpenAI's usage endpoint and posts to Slack if the day's spend exceeds the rolling fourteen-day p95 by more than three times.
A liveness probe on the dashboard. The tile now shows last heartbeat at HH:MM with a colour that turns red after ten minutes of silence, instead of cheerfully showing the stale value.
A concurrency alarm. If the same workflow has more than three concurrent executions, PagerDuty pings the on-call.

None of this is exotic. The team had trusted it worked yesterday as a monitoring strategy.

A five-minute audit you can run today

Before lunch:

Open the OpenAI project dashboard. Are hard usage limits set on every project? If the answer is the default, that is no.
Open every n8n workflow that calls a paid API. Look at every node with Retry On Fail enabled. Multiply max tries × batch size × items per batch. If that worst-case scares you, change the configuration.
Search your workflows for self-trigger patterns: HTTP nodes that POST to the same n8n instance, or Error Trigger workflows that restart the failed run. These are loops waiting to happen.
Confirm every long-running workflow has a heartbeat that fails closed. The dashboard turns red on absence, never stays green on the last good value.

When we rebuilt the catalog-enrichment pipeline for this Den Haag e-commerce team as one of our AI agent engagements, the lesson we kept reusing was that n8n is a fine conductor and a dangerous executioner. We moved the model calls behind a small queue worker that owned the rate limit, and left n8n to do what it does well: orchestrate the rest. The shape of the pipeline did not change. The blast radius of a single bad minute did.

Tonight, before you close your laptop: set a hard spend limit on every OpenAI project key you own. It takes ninety seconds. It would have saved this team €1,829.

Key takeaway

Cap your OpenAI spend per project before you trust any workflow. Hard limits turn silent six-hour burns into one obvious 403.

FAQ

Why does OpenAI charge for tokens when my client times out?

Output tokens are billed for what the model generated, not what your client received. If you kill a long completion with a 60-second timeout, you still pay for whatever the model produced before the socket closed.

Is n8n's default Retry On Fail safe for paid APIs?

Not without thought. The default of 5 tries with a 1-second wait multiplies cost on transient errors and ignores Retry-After headers. Turn it off and handle retries explicitly via a queue, or read the header and back off properly.

How do I prevent a workflow from restarting itself in a loop?

Never use an HTTP node to self-trigger the same workflow on error. Write failed items to a queue table and drain it from a separate workflow on a fresh schedule. Live recursion against a rate-limited API is how silent burns happen.

What spend cap should I set on an OpenAI project key?

Set it at roughly three times your normal daily burn, never the default unlimited. The point is that a runaway workflow hits a 403 within hours, not a five-figure invoice at the end of the month.

ai agentsautomationintegrationsworkflowoperationscase study

Building something?

Start a project