Databases

Postgres audit checklist: what we run before any agent

It is 23:00 in Bangkok. The agent has been running for seven hours and the ops dashboard looks great. The database is at 100% CPU. Three pilots have ended like this.

Jacob Molkenboer· Founder · A Brand New Company· 17 Nov 2024· 8 min

Open wooden index-card drawer with brass divider, chartreuse tag, and clipped ledger paper on ivory surface.

It is 23:00 in Bangkok. The agent we deployed at 16:00 has been running for seven hours. The client's ops dashboard shows 18,400 invoice records processed, which is great. The client's primary Postgres is at 100% CPU, the read replica is 41 minutes behind, and an account manager has just been paged because the customer portal is timing out. That is not great.

The agent was doing exactly what we told it to do. The database was not built for what the agent was doing.

We have shipped fourteen agents into production. Three of them crashed the underlying Postgres in their first week. Every time it was one of the same three things: a missing index on a column the agent suddenly cared about, autovacuum starvation on the table the agent was now writing to constantly, or a jsonb column that nobody had ever measured. Now we run a checklist before any agent gets a connection string.

What the three pilots had in common

Before the agent shows up, Postgres workloads usually have a human-shaped traffic pattern. A few hundred queries per minute. Most reads come from one or two warm pages. Writes are batched by a nightly cron. The database has been "fine" for years.

An automation agent has a different shape. It reads a table the human dashboard never touched. It writes 200 rows a minute instead of 5,000 rows once a night. It checks the same status column every loop and never gives up. It is not slower than a human team. It is also not nicer.

Takeaway

An automation agent is a very patient junior developer with infinite hands. If your Postgres has a soft spot, the agent will find it inside an hour.

Sequential scans on the work table

The first failure mode is always the same. The agent polls a "work to do" table with a query like WHERE status = 'pending' AND assigned_at IS NULL. The table has 400,000 rows. There is an index on id and nothing else. Every poll is a sequential scan. The agent polls every two seconds. The database is now reading the whole table 30 times a minute.

You can find this in 10 seconds.

SELECT schemaname,
       relname,
       seq_scan,
       idx_scan,
       n_live_tup,
       pg_size_pretty(pg_relation_size(relid)) AS size
FROM pg_stat_user_tables
WHERE n_live_tup > 10000
ORDER BY seq_scan DESC
LIMIT 20;

Any row in the top 20 where seq_scan is materially larger than idx_scan, on a table bigger than a few thousand rows, is a candidate. We then check which columns the agent will filter on and add a partial index that matches the actual query shape:

CREATE INDEX CONCURRENTLY invoices_pending_idx
  ON invoices (assigned_at)
  WHERE status = 'pending';

The CONCURRENTLY matters. The first time we forgot it on a 12 GB table the customer portal froze for nine minutes while the index built under an exclusive lock. The Postgres docs on CREATE INDEX are blunt about the cost of the non-concurrent variant.

Autovacuum that never finishes

The second failure mode is slower and meaner. The agent is now writing 200 rows a minute to a table that used to see one nightly batch. Every update creates a dead tuple. Autovacuum has default settings, which means it kicks in after 20% of the table is dead. That is a reasonable threshold for a table with one batch per night. It is not reasonable when the agent rewrites the same row eight times during a single workflow.

Run this:

SELECT relname,
       n_live_tup,
       n_dead_tup,
       round(n_dead_tup * 100.0 / NULLIF(n_live_tup, 0), 1) AS pct_dead,
       last_autovacuum,
       last_autoanalyze
FROM pg_stat_user_tables
WHERE n_live_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 20;

Anything where pct_dead is above 20% and last_autovacuum is older than a few hours is autovacuum starvation. Either the table never reaches the threshold, or it reaches it constantly and autovacuum cannot keep up with the writes.

The fix is per-table tuning, not global. Lower the scale factor for the tables the agent actually touches:

ALTER TABLE invoices SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_analyze_scale_factor = 0.01
);

The official routine vacuuming chapter covers the math and the trade-offs in more detail than fits here. The thing to internalise: the defaults are tuned for tables a human writes to. Agents are not human.

The jsonb column nobody measured

Number three is the one that scares us the most because it is invisible. The schema has a column called payload jsonb or metadata jsonb or raw jsonb. It was added two years ago to "store the rest of the webhook for later". The dashboard never reads it. The application never writes more than 4 KB into it. Nobody knows what is in there.

Then the agent shows up. It pipes the entire model response into raw as an audit trail. Each row is now 80 KB. TOAST is doing 90% of the work the database is doing. Replication starts lagging because every WAL record carries the whole jsonb. The application server is fine. The dashboard is fine. The numbers on the board look great. The disk is on fire.

Find them before the agent does:

SELECT relname,
       pg_size_pretty(pg_relation_size(relid))         AS table_size,
       pg_size_pretty(pg_total_relation_size(relid)
                      - pg_relation_size(relid))       AS toast_and_indexes,
       pg_size_pretty(pg_total_relation_size(relid))   AS total
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 15;

When the toast_and_indexes column is larger than the table itself, you have a jsonb column doing work nobody asked for. The right answer is almost never "make the jsonb smaller". It is "stop writing the whole webhook to it". Move the audit trail to object storage, store a pointer in the row, and let Postgres go back to being a database.

Connection pool reality

The fourth item on the checklist is the connection count. The agent framework documentation tells you to set max_connections to whatever your agent needs. The Postgres documentation tells you that every connection costs roughly 10 MB of RAM and a process. Both are correct. Neither tells you that the application server, the read replicas, the analytics jobs, the migration tool, and the four humans on pgAdmin are also holding connections.

SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state
ORDER BY count(*) DESC;

If idle in transaction is anything other than near zero, the agent has a transaction-management bug and you should fix it before going live. If active is consistently near max_connections, you need PgBouncer in transaction mode in front of the agent, not a higher max_connections setting. Raising the cap on a database that is already maxed out at the OS level turns a queue into a melt.

Lock waits no human ever sees

The last item is the one that only shows up under agent traffic. Humans serialise themselves: one person clicks "process invoice", then waits, then clicks the next one. Agents do not. Two workers picking up the same row at the same time will deadlock cleanly if you used SELECT FOR UPDATE SKIP LOCKED, and will block each other indefinitely if you did not.

SELECT blocked.pid       AS blocked_pid,
       blocked.query     AS blocked_query,
       blocking.pid      AS blocking_pid,
       blocking.query    AS blocking_query,
       now() - blocked.query_start AS blocked_for
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock'
ORDER BY blocked_for DESC;

Run that under load for an hour. If you see the same query showing up on both sides of the join, your agent's job-claim logic is wrong. The pattern that works in production:

SELECT id
FROM invoices
WHERE status = 'pending'
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 10;

This is also a good moment to remember that Postgres has historically refused to support query hints on the principle that the planner should be trusted. That argument is being relitigated in public on the hackers list ahead of Postgres 19, and the discussion is worth following if you care about planner stability. Whatever lands there, do not bet a production agent on a hint that does not exist yet. Get the indexes right.

The five-minute audit, in order

The checklist we run on every Postgres before an agent gets a connection string:

Sequential scans on big tables. Anything over a few thousand rows that is being scanned more often than indexed.
Dead tuple percentage and last autovacuum timestamp. Anything dead, hot, and unvisited gets per-table tuning.
TOAST size per table. Anything where TOAST is larger than the heap gets a conversation about what is in the jsonb.
Connection state breakdown. Idle-in-transaction is a code bug. Saturated active is a pooler problem.
Lock waits under simulated agent load. Run the agent against staging for an hour and watch pg_blocking_pids.

None of this is exotic. All of it gets skipped because the database has been fine for years and the agent demo went well on Tuesday.

When we built the invoice-chasing agent for a Dutch building-services client earlier this year, the audit caught a jsonb column that would have grown by 4 GB a day. We moved the audit trail to S3 inside a week and the agent has been running quietly since. The longer version of how we wire AI agents into existing databases without breaking them is on the work page.

The smallest thing you can do today: open psql against your production database, paste in the first query above, and look at the top 20 rows. If anything there surprises you, you are not ready for an agent yet.

Key takeaway

An automation agent is a patient junior developer with infinite hands. If your Postgres has a soft spot, it will find it inside an hour.

FAQ

How long does this audit take on a typical client database?

About 30 minutes against a read replica. The five queries are read-only and cheap. Reading the results and deciding what to fix takes longer than running them.

Do we need pg_stat_statements installed?

Not for this audit. Every view used here ships with vanilla Postgres. pg_stat_statements is useful for the next layer of tuning, but you do not need it to catch the three failure modes in this post.

Does the checklist work on RDS or Cloud SQL?

Yes. The queries are identical. The autovacuum tuning uses parameter groups instead of postgresql.conf, but the per-table ALTER TABLE SET works the same way.

What about Supabase or Neon?

Same queries, same fixes. On serverless Postgres the connection-pool item matters more, not less, because cold connections are expensive. Run the pg_stat_activity check after a real load test.

process automationai agentsarchitectureoperationstoolingworkflow

Building something?

Start a project