← Blog

Process automation

Process automation gone wrong: the €4,100 polling bug

A 19-person logistics SaaS in Utrecht almost lost a year of margin to a 200ms poll loop. The agent did its job. The cloud bill did the damage.

Jacob Molkenboer· Founder · A Brand New Company· 30 Oct 2024· 9 min
Brass stopwatch on ivory paper desk, paper ticker tape curling left, green receipt under it, leather blotter, pencil.

The bill that shouldn't have happened

The CFO of a 19-person logistics SaaS in Utrecht forwarded us a single screenshot at 16:47 on a Thursday. Their managed Postgres + Redis bill on their cloud provider had gone from €380 in February to €4,100 in May. No new customers. No new features in production. No infrastructure migration. Just three months of compounding cost from a single process automation worker they could not explain.

The ops lead had a theory: "Something we built is hitting the database too hard." She was right, but only by accident. The cost driver wasn't the database. It was a Redis queue, and the offending process was something her own team had asked for back in October.

What the agent was supposed to do

The company runs a freight-routing platform. Their customers (mostly mid-sized 3PLs) push EDI-style order documents in via SFTP, an HTTPS endpoint, and increasingly via a partner API. A small worker process picks those up, validates them, enriches them with carrier metadata, and pushes them onto Redis as jobs for the routing engine.

We had built the intake worker for them in October 2025. The brief was simple: keep the inbox empty. Whenever a document lands on Redis under orders:incoming, pop it, validate, enrich, push to orders:routed. Their volume at the time was about 1,200 documents per day, peaking around 09:00 Amsterdam and again at 14:00. Average idle time between jobs: roughly 45 seconds.

That number is the one that mattered, and we'll come back to it.

The 200ms decision

The worker was written in Node.js. The original author (one of our engineers, sitting in Chiang Mai at the time) made a defensible call: use a polling loop with a 200ms tick, because the routing engine needed jobs to feel "instant" during demos. The relevant code looked like this:

while (running) {
  const job = await redis.lpop('orders:incoming');
  if (job) {
    await process(job);
    continue;
  }
  await sleep(200);
}

In October, with one worker and 1,200 jobs/day, this was fine. The Redis instance was on a starter plan. Cost was invisible.

In November, the company signed a partner deal that roughly doubled their daily volume. Sensible. In January, they spun up a staging environment that mirrored production, and someone forgot to lower the worker concurrency there. In February, the routing team added two more worker instances "for headroom." By March, a fifth worker had been added by a contractor who never told anyone.

Each worker, idle, was doing five LPOP calls per second. Five workers, four environments (prod, staging, two preview branches), 20 worker processes in total. That is 100 LPOP calls per second, idle. 8.64 million calls per day. 260 million calls per month. Against a managed Redis instance billed partly on operations and partly on egress, with metrics shipped to the provider's observability tier (also billed per event), the cost curve had become exponential without anyone noticing.

Reading the actual cost driver

When we pulled the billing breakdown, the surprise wasn't where the money went. The Redis line item was only €620 of the €4,100. Postgres was €310. The real damage was in three places nobody had looked.

The first was their managed observability tier. Every Redis command was being shipped as a metric event because someone had enabled "verbose mode" during an incident response in February and never turned it off. That alone was €1,400 per month.

The second was egress between the workers (in eu-west-1) and the Redis instance (in eu-central-1). A region mismatch on a tight poll loop is a slow tax that compounds. €890 per month.

The third was the autoscaler. Their Postgres connection pool was sized to absorb spikes, and the polling workers each held an idle connection. The provider's billing model charged for "active connection hours" above a baseline. €560 per month, billed as a separate line item that nobody read because it had been €40 in October.

The actual database load was almost nothing. The cost was entirely in plumbing.

The three-line fix

The fix was not architectural. It did not require a new queueing system, a refactor, or a meeting. It was three lines:

while (running) {
  const job = await redis.brpop('orders:incoming', 30);
  if (!job) continue;
  await process(job[1]);
}

BRPOP is a blocking pop. It waits up to 30 seconds for a job to land, then returns. If nothing arrives, it returns nil and we loop. While it waits, the Redis instance does no work for that client. The Node.js process holds one connection open, idle, and consumes effectively zero CPU. The semantics are documented in the Redis BRPOP reference, and they have not changed in a decade.

After we deployed this to all environments on a Tuesday afternoon, idle LPOP traffic dropped from 100 ops/sec to about 0.03 ops/sec (one BRPOP returning empty every 30 seconds, per worker). The verbose-metrics line item evaporated overnight. The egress line item dropped 94% the following week. The autoscaler's connection-hour count fell below the billing threshold within a billing cycle.

May's bill was €312. June's projection is €290.

Warning

If you have a worker doing tight polling against a managed queue (Redis, SQS, RabbitMQ, Postgres FOR UPDATE SKIP LOCKED), the cost driver is almost never the database itself. It is the observability tier, the egress, and the connection-hour line you have never read.

What the dashboard missed

The strange part of this story is that the team had monitoring. They had Grafana, they had Sentry, they had a Slack alert for "Redis latency > 50ms." None of those fired. The queue was healthy. The workers were healthy. Latency was, if anything, abnormally good, because the workers were so eager to pop that no job ever sat in the queue for more than a few milliseconds.

The thing that wasn't measured was idle work. There is no dashboard widget called "ops per second the system is doing while doing nothing useful." There is a billing dashboard, but billing dashboards are read by finance, and finance reads them once a month after the invoice lands.

The lag between "we made the bad decision" (October) and "we noticed the bad decision" (May) was seven months. That is roughly the half-life of "we'll look at it next sprint" in a small team.

A runbook for any worker that touches a queue

After this, we wrote a short runbook for ourselves, and started checking it into every project that ships an automation worker. The principle is simple: if there isn't a written contract for how the worker should behave when idle, the worker will eventually behave badly and nobody will know whose decision it was.

The runbook is short:

  • Prefer blocking primitives over polling. BRPOP, BLPOP, XREAD BLOCK, Postgres LISTEN/NOTIFY, SQS long-polling, RabbitMQ basic.consume. Read your queue's documentation and find the blocking call.
  • If you must poll, exponential backoff. Start at 100ms, double on empty, cap at 30 seconds. Reset on a successful pop.
  • One environment, one config. The "we'll just copy production" approach to staging is how five-worker fleets become twenty-worker fleets.
  • Bill alerts at the line-item level, not the total. A 3x jump on one row catches this in week one, not month seven.
  • Treat the observability tier as billed code. Metrics are events. Events are money. Verbose-mode toggles need an expiry date.

None of this is novel. None of it is surprising once you have been bitten. But it is, in our experience, the single most common cause of "our cloud bill quietly exploded" in small SaaS teams that ship automation workers.

The smallest thing you could do today

When we rebuilt the intake worker for the Utrecht team (a process automation job that started as an inbox-zero brief and grew into a small fleet of agents), the thing we ran into was that the original 200ms decision had looked sensible in every code review it ever passed. We ended up solving it by writing the runbook above and pinning it next to the worker source so the next engineer cannot miss it.

The five-minute audit, if you want to run it today: open your cloud billing dashboard, filter to the last 90 days, and sort line items by largest absolute change. The answer is in the first three rows. It is almost certainly not the row you expect.

Key takeaway

A tight polling loop will not blow up your database. It will blow up your observability bill, your egress, and your connection-hours, and it will do it quietly for months.

FAQ

Why does polling every 200ms cost so much when the database itself barely notices?

The cost compounds in observability events (each command shipped as a metric), cross-region egress, and per-connection-hour billing. The database load is often the smallest line item.

When should I use BRPOP instead of LPOP in a polling loop?

For any worker reading a Redis list as a queue with idle periods. BRPOP blocks server-side until a job arrives, so both the client and Redis do zero work while waiting.

How do I catch this kind of bill creep earlier?

Set per-line-item alerts on your cloud bill, not just a total threshold. A 3x jump on one row catches the problem in the first billing cycle; a 20% jump on the total takes months to look suspicious.

Is the same risk present with SQS or RabbitMQ?

Yes. SQS without long polling racks up empty-receive charges fast. RabbitMQ has a different billing model but a tight polling consumer still wastes connections and metrics. Use the queue's blocking primitive.

process automationautomationoperationscase studyarchitecturetooling

Building something?

Start a project