Process automation

Postgres queues over HubSpot workflows: a staffing rebuild

On a Tuesday in February the recruiting ops lead at a Rotterdam staffing agency watched her HubSpot workflows queue 4,200 tasks and stall. We helped her replace them.

Jacob Molkenboer· Founder · A Brand New Company· 26 Apr 2024· 9 min

Brass card-sorter tray of paper tickets on ivory desk, one ticket tabbed chartreuse, leather blotter, wax seal.

It was a Tuesday in February. Anouk, the recruiting ops lead at a staffing agency on Rotterdam's Westblaak, was staring at her HubSpot workflows dashboard. The "Candidate stage moved to interview" workflow had 4,217 enrolments queued. The next workflow in the chain, "Notify account manager", had been in a "starting" state for thirty-eight minutes. Three recruiters on her floor were waiting for the candidate email to fire so they could see who had been told what. It would fire, eventually, in batches, in an order no one could reconstruct from the UI.

She wrote to us that evening. Six weeks later, the 14 workflows that ran her placement pipeline were gone, replaced by a 200-line Python worker reading from a Postgres queue. This post is what we built, why HubSpot stopped fitting, and the one Postgres feature that makes the whole thing work.

What the workflows actually did

The agency places contractors into roles at mid-market companies in Zuid-Holland. About 60 internal staff. Their HubSpot pipeline was the usual: candidate sourced, intro call, client interview, offer, placement, invoice, contract renewal. Each stage change triggered a workflow chain:

Send the candidate a stage-specific email.
Create a HubSpot task for the recruiter who owns the candidate.
Post a message in the Slack channel for the client account.
Append a row to the master placement Google Sheet (Finance reads from it).
Bump the candidate's LinkedIn outreach status in a separate tool.
If the stage is "placement", fire a webhook to Exact Online to draft an invoice.

Fourteen workflows, around 80 branches, custom code actions in six of them. None of it was exotic. All of it broke under load.

Where HubSpot's workflow engine starts to fight you

HubSpot workflows are excellent for what they were designed to do: marketing-flavoured drip sequences over CRM records, with conditional branches a non-technical person can edit. The engine starts to fight you when ops logic gets real.

You can't read state from another workflow mid-run. Workflows don't share scratch space. If "Notify account manager" needs to know what "Send candidate email" decided, you encode that decision back onto the contact record as a custom property and pray nothing else writes to it. The agency had nine custom properties that existed solely to pass state between workflows. None of them had a meaningful name to a recruiter.

You can't see why a branch was taken. The workflow log shows that a contact went down branch A. It does not show the value of the condition that made that decision. When the wrong email goes out, you guess.

Custom code actions have rate limits. HubSpot's documented limits on serverless action execution are the kind of numbers that don't matter until they do. The agency hit them on Monday mornings when 200 stage changes from weekend client interviews entered the system at once.

You can't deploy a workflow change with git. Every edit is a click-and-save in the UI. Reviewing a change means screen-sharing. Rolling back means trying to remember what the branch looked like yesterday.

The cost line keeps moving. Operations Hub Professional sits at roughly €720 per month for the tier that allows custom code actions. The agency was paying extra for overage on action executions, and that line was growing.

Warning

If you have more than five custom code actions doing real branching logic across your workflows, you have a queue and a worker. You just haven't named them yet. HubSpot is now the GUI on top of an undocumented service you're paying to host.

The Postgres queue, in shape

The replacement is unglamorous on purpose. One table, one worker, one cron, plus the existing HubSpot webhook.

CREATE TABLE job (
  id          bigserial PRIMARY KEY,
  kind        text NOT NULL,
  payload     jsonb NOT NULL,
  run_at      timestamptz NOT NULL DEFAULT now(),
  attempts    int NOT NULL DEFAULT 0,
  state       text NOT NULL DEFAULT 'ready',
  last_error  text,
  dedupe_key  text,
  created_at  timestamptz NOT NULL DEFAULT now(),
  UNIQUE (kind, dedupe_key)
);

CREATE INDEX job_ready_idx
  ON job (run_at)
  WHERE state = 'ready';

HubSpot fires its existing webhook when a deal or contact stage changes. A small Flask endpoint reads the webhook and inserts one row per side-effect: one for the email, one for the Slack ping, one for the spreadsheet row, one for the Exact Online draft. The dedupe_key is usually contact_id:stage:iso_week, which the UNIQUE constraint turns into idempotency for free. A retried webhook produces the same key, and the second insert silently no-ops.

The worker is a Python loop that claims one job at a time and runs the matching handler. The handlers are 20 to 40 lines each: HTTP POST to Slack, HubSpot API call to create a task, gspread append, Exact Online OAuth call.

The one trick: SELECT FOR UPDATE SKIP LOCKED

This is the Postgres feature that makes "queue in your existing database" a real choice instead of a starter-project compromise. It landed in Postgres 9.5 in 2016 and is genuinely all you need. Brandur Leach's writeup on Postgres job queues is the canonical reference if you want the full theory.

WITH next AS (
  SELECT id FROM job
   WHERE state = 'ready' AND run_at <= now()
   ORDER BY run_at
   FOR UPDATE SKIP LOCKED
   LIMIT 1
)
UPDATE job
   SET state = 'running',
       started_at = now(),
       attempts = attempts + 1
  FROM next
 WHERE job.id = next.id
RETURNING job.*;

Three workers can run this concurrently and never fight each other. The row-level lock on the CTE pick means worker A's claim is invisible to worker B's claim. Workers that fail mid-job leave their row in running with an attempts bump; a separate reaper query resets anything that's been running too long. There is no Redis. There is no SQS. There is no Celery. There is one table.

The worker loop in full:

import time, traceback
import psycopg
from psycopg.rows import dict_row
from handlers import HANDLERS

DSN = "postgresql:///agency"
CLAIM_SQL = open("claim.sql").read()

def claim(conn):
    with conn.cursor() as cur:
        cur.execute(CLAIM_SQL)
        row = cur.fetchone()
        conn.commit()
        return row

def finish(conn, job_id, ok, err=None):
    with conn.cursor() as cur:
        if ok:
            cur.execute("UPDATE job SET state='done' WHERE id=%s", (job_id,))
        else:
            cur.execute(
                """UPDATE job
                      SET state = CASE WHEN attempts >= 5 THEN 'dead' ELSE 'ready' END,
                          run_at = now() + (interval '30 seconds' * (attempts * attempts)),
                          last_error = %s
                    WHERE id = %s""",
                (err, job_id),
            )
        conn.commit()

def main():
    with psycopg.connect(DSN, row_factory=dict_row, autocommit=False) as conn:
        while True:
            job = claim(conn)
            if not job:
                time.sleep(2)
                continue
            try:
                HANDLERS[job["kind"]](job["payload"])
                finish(conn, job["id"], ok=True)
            except Exception:
                finish(conn, job["id"], ok=False, err=traceback.format_exc())

if __name__ == "__main__":
    main()

A separate one-minute cron handles the crash-recovery case the worker loop deliberately ignores. Anything stuck in running for more than five minutes gets reset, on the assumption that no honest handler runs that long:

UPDATE job
   SET state = 'ready',
       last_error = 'reaped after worker timeout'
 WHERE state = 'running'
   AND started_at < now() - interval '5 minutes';

Cron it every minute. That is the entire crash-recovery story.

That, plus the handler module, plus the webhook receiver, is the whole system. The retry policy lives in the SQL. The dead-letter behaviour lives in the SQL. Observability is SELECT state, count(*) FROM job GROUP BY state, plotted on a single Grafana panel.

What stayed in HubSpot

This is the part most "we replaced our CRM" stories get wrong. The agency did not replace HubSpot. HubSpot is good at being a CRM: contact records, pipelines, properties, marketing email, the deal board the recruiters live in. Replacing that would have been a year of pain for no gain.

We replaced the orchestration layer. HubSpot still owns the data. The Python worker reads from HubSpot via the API, writes back via the API, and uses Postgres only for queueing the side-effects. From a recruiter's point of view nothing changed, except the emails now fire in seconds instead of "sometime today".

The cutover, one workflow at a time

We did not switch overnight. The cutover ran four weeks in parallel with the old system.

Week one. Webhook receiver up, queue populated, no handlers firing. We watched the queue depth match the workflow enrolment count, give or take a few seconds. That was the silent dry run.

Week two. One handler live: the Slack ping. The matching HubSpot workflow was disabled. If the new code dropped a message, recruiters would notice in minutes and we would hear about it. They did not.

Week three. The rest of the handlers, one per day, paired with disabling the matching workflow. The worst day was the Exact Online invoice draft. Exact's published rate limit is 60 requests per minute, but a single integration's tokens silently throttle at roughly 20 per minute under sustained load. We learned that by watching the dead state fill up at 14:30 on a Wednesday afternoon. We added a per-handler concurrency cap, implemented as a Postgres advisory lock keyed on the handler name, so one Exact job runs at a time regardless of how many are queued behind it. Five lines of SQL, one line of Python.

Week four. The legacy workflows were archived, not deleted. Three months later we deleted them. Nothing had reached for them.

Six months in

Numbers from the agency's first half of 2026, against the same months in 2025:

14 workflows reduced to 7 job kinds. The seven kinds map one-to-one to side-effects we can name in a sentence.
Average notification latency, stage change to Slack ping: from roughly 9 minutes to about 3 seconds.
Custom code action overage: gone. Operations Hub Professional stays, the overage line went to zero.
One production incident in six months: a typo in a Slack channel name. The job retried, failed five times, landed in dead, and Anouk fixed it in three minutes.
Lines of Python: 211. Lines of SQL: 38. Lines of YAML: 0.

The thing Anouk says most often when she shows the system to other ops leads is "I can read it." That is the actual win. The workflow editor was unreadable: 80 branches in a graph that did not fit on a monitor. The Python file fits on a monitor.

One thing the numbers do not capture: the change in how Anouk talks to her CTO. Before, when ops needed a workflow tweaked she filed a ticket and waited a week. Now she opens a pull request against the handlers file, gets a review by lunch, and ships the same afternoon. Ops moved from being a vendor-managed line item into a piece of software the team owns and can read.

When this is the wrong call

If your "workflows" are five-step marketing drips with no branching, do not do this. HubSpot is the right tool. If your team has zero people who can read Python at 2am, do not do this either. You would be trading a vendor problem for a staffing problem. The break-even point we have seen, after this kind of project at four mid-market clients, is somewhere around six custom code actions and three workflows that branch on each other. Below that, stay in HubSpot. Above that, the queue starts paying for itself within a quarter.

When we built the placement-automation worker for the Rotterdam agency, the thing that bit us hardest was idempotency on the HubSpot write-back. The same job firing twice would create duplicate recruiter tasks, recruiters would see them, get confused, and ping ops. We solved it by making the dedupe_key on the job table do double duty as the idempotency token on the HubSpot side, sent in the request body so any retry collapses to the same write. This kind of process automation is most of what our studio does these days.

The smallest thing you can do today, if you are staring at a HubSpot workflows screen that feels heavier every month, is open the editor, count the custom code actions, and check your last three monthly invoices for action-overage line items. That count and that number are the size of the problem.

Key takeaway

If you have six custom code actions doing real branching in HubSpot workflows, you already have a job queue. You just have not named it yet.

FAQ

Do we need to move off HubSpot entirely to do this?

No. HubSpot still owns the CRM records, properties and pipelines. You replace only the orchestration layer that fires side-effects. The recruiter UI is unchanged.

How big does the team need to be before this pays off?

Roughly six custom code actions and three workflows that branch on each other's state. Below that, HubSpot workflows are still the right tool. Above that, a Python worker starts paying for itself within a quarter.

Why Postgres instead of SQS, Celery or a hosted queue?

One fewer system to host, one fewer set of credentials, one fewer place to look when something breaks. SELECT FOR UPDATE SKIP LOCKED handles the locking that justified a separate queue ten years ago.

What happens to a job that fails permanently?

After five attempts it moves to the dead state and stays there. A daily report lists everything in dead. You fix the handler or the data, then retry by hand with a one-line UPDATE.

process automationautomationworkflowintegrationscase studytooling

Building something?

Start a project