← Blog

Process automation

Self-hosted n8n: how a stale sentinel ate 11,400 events

On Monday morning the inbox had 47 angry tickets. The candidate pipeline was empty. The queue dashboard read healthy. The 11,400 missing events were already gone.

Jacob Molkenboer· Founder · A Brand New Company· 23 Dec 2024· 9 min
Toppled brass relay switch, frayed pneumatic tube leaking dust, torn paper log with green tab, cracked red wax seal on ivory paper.

The first ticket came in at 08:14 on Monday. A recruiter at a hospital client in Eindhoven could not see the seven applications that had landed on her ATS over the weekend. By 09:30 there were forty-seven tickets. The candidate pipeline view was empty. The n8n queue dashboard read healthy. The webhook receiver was returning 200s for every test payload the support engineer fired at it.

The events were not delayed. They were gone. Eleven thousand four hundred of them, across seventy-two hours.

This is what happened inside a 33-person HR-tech SaaS in Apeldoorn between Friday evening and Monday morning, and the post-mortem we wrote with their lead engineer on the Tuesday after.

The stack on Friday

The client runs a workforce platform that sits between job boards, ATS systems, and payroll. Most of the inbound work is webhooks: a new application from Greenhouse, a status change from BambooHR, a contract event from Visma, a Slack notification from internal recruiters. Roughly 4,000 to 6,000 events on a weekday, less on weekends, but never zero. Hospitals and warehouses post jobs every day.

They run self-hosted n8n in queue mode: one main instance handling webhooks and the editor, three worker instances pulling jobs from a Redis-backed BullMQ. Postgres holds the workflow definitions and the execution history. Redis runs as a three-node Sentinel setup on the same Hetzner cluster, configured after a single-node Redis took the platform down for ninety minutes in late 2024.

It is a sensible architecture. It is also the same architecture you will find inside roughly two hundred mid-sized European SaaS teams that picked n8n over Zapier in the last eighteen months. Nothing here is exotic.

The failure on Friday at 19:42

The Redis master node hit a kernel OOM and got killed. The Sentinels promoted a replica within four seconds. So far, healthy. The main n8n instance noticed, opened a new connection to the promoted master inside thirty seconds, and started accepting webhooks again. The Sentinel logs are clean. redis-cli -p 26379 sentinel masters showed the new master with the correct flags.

The workers did not notice.

Two of the three workers had been started with a hardcoded Redis host in their env file, not a Sentinel client config. The third worker had the Sentinel block but cached the original master IP in its connection pool and never refreshed the sentinel topology. When the master died, all three workers held connections that timed out silently every few seconds, retried against an IP that no longer answered, and logged nothing above WARN.

n8n's main instance kept enqueueing jobs. BullMQ does not care that no worker is pulling. The waiting list grew. Nothing dropped. Yet.

The retry queue that ate the events

The actual loss came from a setting we have seen misread on at least eight self-hosted n8n installs this year. In their n8n docker-compose, the workers had been started with a pruning block that looked reasonable on paper:

EXECUTIONS_DATA_PRUNE=true
EXECUTIONS_DATA_MAX_AGE=24
EXECUTIONS_DATA_PRUNE_MAX_COUNT=10000
QUEUE_BULL_REDIS_TIMEOUT_THRESHOLD=10000
QUEUE_RECOVERY_INTERVAL=60

The pruning settings here are not the problem in themselves. The problem is what happens when a job has been enqueued, no worker takes it for longer than the BullMQ stalled-job interval, and the queue's retry policy moves it to the failed set. Their retry config was three attempts with exponential backoff. After three failed pickup attempts, jobs went to the failed queue. The failed queue had a TTL of one hour and a max-count of two thousand. Anything past either ceiling was pruned out by the BullMQ janitor.

Over the weekend the workers logged 38,000 stalled-job errors. Eleven thousand four hundred jobs aged past the failed-queue TTL and were silently deleted. There is no warning sound when a janitor runs.

Warning

If your retry queue has a finite TTL or a finite max-count, you do not have a retry queue. You have a delayed-loss queue. Move dead-lettered jobs out of Redis before the janitor sweeps them, or accept that "eventually consistent" includes "eventually gone".

Why nobody saw it

The client had monitoring. The main n8n instance was healthy. Its /healthz returned 200. Webhook endpoints returned 200. The Postgres execution-history table kept being written to, because a handful of cron workflows ran on the main instance and never touched the worker queue. The Grafana board was green from corner to corner.

What they did not have was a check on queue depth versus worker throughput. BullMQ exposes both. The dashboard they used charted only the active and waiting counts on a one-minute average. Waiting peaked at 4,800 jobs late Saturday, recovered briefly when the janitor swept the failed queue, climbed again. The chart looked like a gentle sawtooth. To a human eye it looked like a workload pattern, not a leak.

This is the part we want every operations lead to internalise. Healthy is not a status flag. Healthy is a ratio between what came in and what got processed. If your dashboard cannot answer "did each webhook produce its intended downstream effect" then your dashboard is decoration.

The fix on Monday

Three changes shipped before lunch.

First, the workers were restarted with a proper Sentinel client config. n8n reads Redis settings from a single env block when you point it at Sentinel correctly:

QUEUE_BULL_REDIS_HOST=
QUEUE_BULL_REDIS_PORT=
QUEUE_BULL_REDIS_SENTINEL_HOSTS=10.0.0.11:26379,10.0.0.12:26379,10.0.0.13:26379
QUEUE_BULL_REDIS_SENTINEL_NAME=mymaster
QUEUE_BULL_REDIS_SENTINEL_PASSWORD=${REDIS_SENTINEL_PASSWORD}

Note the empty HOST and PORT. If you leave those set, ioredis silently uses them and ignores the sentinel list. That is the trapdoor that bit two of their three workers, because both env files had been copied from the pre-Sentinel single-node era and never cleaned up.

Second, the failed-queue TTL was removed and the max-count raised to fifty thousand, with a separate cron pushing aged dead-letters into a Postgres n8n_dead_letters table for offline replay. The replay script is twenty lines of Node and runs nightly. Redis is now the in-flight retry layer. Postgres is the durable graveyard.

Third, a deliberately stupid alert was wired into Grafana: "inbound webhooks in the last fifteen minutes" minus "completed executions in the last fifteen minutes". If that delta exceeds 200 for ten minutes, PagerDuty wakes someone up. The alert does not care why. It just notices the gap.

Recovering the 11,400

Greenhouse and BambooHR both keep a webhook delivery log on their side with a thirty-day window. We wrote a replay job that hit their delivery-log APIs, pulled every event with a Friday-to-Sunday timestamp, hashed each payload against what was already in Postgres, and re-fed the missing ones into the n8n inbound endpoint. About 10,800 events came back that way.

Visma does not keep a delivery log. Roughly 600 events from Visma payroll were unrecoverable. The client phoned the affected employers on Tuesday and resolved each one by hand. It cost the engineering lead two days. It cost the support team a week. No customer left, which surprised us. The lesson from that part: post a real post-mortem within twenty-four hours, name what you broke, name what you changed, and most B2B customers will respect you more than before.

The architectural takeaway

If your business depends on webhook-driven automation, three things have to be true.

Your queue must survive infrastructure failover. Test it. Kill the Redis master in staging on a Wednesday afternoon and watch the workers reconnect. If they do not, your workers do not have a Sentinel client config. They have a fragile pointer wearing a Sentinel costume.

Your dead-letter queue must live somewhere durable. Redis is fine for in-flight retries. It is not where you want a job to spend Saturday night. Postgres, S3, anything that does not have a janitor running on a timer.

Your monitoring must measure the gap. Not "is the service up", but "did each event that came in produce the work it was supposed to". For inbound webhooks that is countable. For more complex flows it is a ratio of source-of-truth records against expected-outcome records. Build the diff and alert on it.

None of this is novel. The Redis Sentinel documentation warns plainly that "the client library has to be Sentinel-aware". The n8n queue-mode docs spell out the env-variable trap. The BullMQ docs explain the failed-queue lifecycle in detail. The problem is that mid-sized teams running self-hosted automation usually inherit the stack from one engineer who left, and the next engineer trusts that the previous one read the footnotes.

What we keep telling clients

When we rebuilt the queue layer for this client, the thing that surprised us was how cheap the durable dead-letter was: one Postgres table, one replay script, twelve hours of work. The other twenty hours went into convincing three different teams that "healthy equals green" was the lie that had cost them the weekend. If you run a similar shape of process automation at scale, our advice is the same we gave them: build the gap alert first, then go fix everything else.

The five-minute audit for your own stack: open your queue dashboard, write down the number of jobs that entered the system in the last hour, write down the number of executions that completed. If the second number is smaller and you cannot explain where the difference went, you have the same problem they had on Friday at 19:42.

Key takeaway

If your retry queue has a finite TTL or a finite max-count, you do not have a retry queue. You have a delayed-loss queue.

FAQ

Does this failure mode only affect n8n?

No. The same pattern hits any queue layer with Redis and finite-TTL failed-job retention. BullMQ, Sidekiq Pro, Celery on Redis, RQ. The trigger is always: workers stop pulling, retries exhaust, janitor sweeps.

How do I test Sentinel failover safely?

In staging, kill the current master with redis-cli SHUTDOWN NOSAVE and watch the worker logs reconnect within seconds. Repeat weekly until it stops being scary. Production failover should never be the first time you see this path.

Should we just move off self-hosted n8n to a hosted runner?

Not necessarily. Self-hosted is fine if you treat the queue and dead-letter layer like a database, not like a cache. Most teams treat it like a cache. That is what costs them the weekend, not the choice of n8n.

How big does a team need to be to run n8n at scale?

One competent platform engineer and a working monitoring habit. We have seen three-person ops teams run it well and twenty-person teams run it badly. Size correlates with nothing; ownership correlates with everything.

process automationautomationworkflowintegrationsarchitectureoperations

Building something?

Start a project