← Blog

Databases

Postgres 12 audit checklist: what we run before AI retrofits

A Friday afternoon, a read-replica granted, a Monday deadline. Here is the boring audit checklist we run before we will quote AI-agent work on a Dutch SME's Postgres 12.

Jacob Molkenboer· Founder · A Brand New Company· 15 Dec 2025· 9 min
Half-open oak index-card drawer with cream cards, brass divider, green tab, leather logbook, paperclip on ivory linen.

A Friday afternoon in Utrecht. The CTO of a 38-person wholesale company has just dropped us into a read-replica via a temporary IP allow-list. He wants to know, before Monday, whether we can wire a customer-service agent into the order database without breaking the nightly batch. We will not quote until we have run the checklist below. It takes about an hour on a healthy cluster and the better part of a day on a sick one, and the answer to can we ship the agent is buried inside it.

Most Dutch SMEs we audit are still on PostgreSQL 12 in 2026. Community support ended on 14 November 2024, but the database is doing its job, the hosting bill is predictable, and nobody had a reason to move. The moment somebody asks for an AI agent on top of it, that calculus changes. The agent is going to read more, write more, and care a great deal about consistent snapshots that the original schema was never tuned for.

Why we audit before we quote

We have quoted retrofits without an audit. Twice. Both times we ate the difference. The pattern was the same: a query the agent fired forty times a minute had no plan in pg_stat_statements, so we missed it during scoping. By week three the replica lag was climbing in business hours and we were debugging at 22:00 on a Tuesday. Free lesson, paid in our hours.

The checklist below is what we ran for the last fourteen SME prospects between January and May. It is deliberately boring. It has no LLMs in it. It just answers three questions: is the autovacuum keeping up on the tables we are about to read from, do we have visibility into what is already running, and can we get this cluster onto a supported major version without losing a single row that the AVG considers an audit-log entry.

Scoring autovacuum lag on tables above 40 GB

Forty gigabytes is the threshold where autovacuum behaviour stops being a default-settings story and starts being a per-table story. Below that, the stock autovacuum_vacuum_scale_factor of 0.2 usually does fine. Above it, a 0.2 scale factor means autovacuum waits until 8 GB of dead tuples have accumulated before it bothers to run, and by then the cleanup is competing with business hours for I/O.

We pull the picture with this:

SELECT
  schemaname || '.' || relname AS table,
  pg_size_pretty(pg_total_relation_size(relid)) AS size,
  n_live_tup,
  n_dead_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
  last_autovacuum,
  autovacuum_count,
  EXTRACT(EPOCH FROM (now() - last_autovacuum)) / 3600 AS hours_since_vacuum
FROM pg_stat_user_tables
WHERE pg_total_relation_size(relid) > 40 * 1024^3
ORDER BY pg_total_relation_size(relid) DESC;

The score is simple. Green: dead_pct under 8, last autovacuum within 24 hours. Yellow: dead_pct between 8 and 20, or last autovacuum more than 48 hours back. Red: anything above 20% dead tuples on a table the agent will read. Red means we are quoting vacuum-tuning work as a precondition, not a phase-two improvement.

Two patterns show up over and over. The first is a 60–90 GB orders table with the default scale factor and 14% dead tuples. Easy fix: bring the scale factor down to 0.02 and the threshold up to 50,000 with an ALTER TABLE, then let it catch up over a weekend. The second is an audit_log table that has never been vacuumed because nobody ever deletes from it. That one is fine on dead tuples and a disaster on freezing. pg_class.relfrozenxid is the column to watch.

Warning

If age(relfrozenxid) on any table is within 100 million of autovacuum_freeze_max_age, your cluster is one wrong week away from an emergency anti-wraparound vacuum that locks the table. That is not the week you want to be installing an agent.

pg_stat_statements coverage on the top 30 queries

The second question we want answered: when we add agent traffic to this database, will we be able to see it? pg_stat_statements is the cheapest observability you can buy, and on every Dutch SME cluster we audit it is either off, half-installed, or set to a track level that misses the queries we care about.

Two checks. First, is it actually loaded into shared_preload_libraries and created in the database the agent will hit? Second, is pg_stat_statements.max high enough that the queries we care about have not been evicted? Default is 5000, which is fine until your ORM generates eight thousand distinct prepared-statement shapes.

SELECT
  calls,
  ROUND(total_exec_time::numeric, 0) AS total_ms,
  ROUND(mean_exec_time::numeric, 2) AS mean_ms,
  rows,
  LEFT(query, 120) AS query_head
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 30;

We score the top 30 on three axes: do we recognise the query (is it documented somewhere in the codebase, or is it a ghost), does the mean execution time line up with what the developer thinks it does, and does any single query account for more than 15% of total exec time. That last one is the agent-risk flag. A query that already eats 18% of total exec time will eat 30% once the agent starts asking it on the customer's behalf forty times a session.

If pg_stat_statements is not installed, we install it during the audit window and come back in 72 hours to read the result. Quoting before that read is guessing.

The cutover-survival question

This is where the audit gets opinionated. PostgreSQL 12 is out of community support. We can keep it alive on a paid extended-support contract, but we will not build a new agent on a database whose security patches now require a vendor relationship the client does not have. So the third part of the audit asks: can we move this cluster onto 17 without losing an AVG-vereist audit-log row during the switchover?

The mechanism is logical replication from 12 to 17. We stand up a parallel 17 cluster, replicate, verify, cut DNS, demote 12 to read-only, keep it around for a fortnight. The risk lives in the seam: rows written to 12 after the publication slot was created but before the cutover finished. For an order table you re-derive. For an audit_log table that legally has to be append-only and complete, you do not get to re-derive.

The survival score for a single tenant depends on four things. One: does every table that needs replication have a primary key or a REPLICA IDENTITY FULL? Two: does the audit_log table use a server-side now() default rather than a client clock? Three: are there any unlogged tables in the critical path? Four: is the application willing to accept a 90-second read-only window during cutover?

On the last fourteen SMEs we scored, the answers split like this. Eleven had at least one critical table with no primary key, usually a join table that someone added in 2017 and forgot. Eight had audit_log timestamps coming from the application server, which means clock skew between two app boxes can put rows out of order across the cutover. Three were clean enough that we could promise zero audit-log row loss on a one-evening cutover. Those three are the ones we quoted aggressively.

What the three survivors had in common

The pattern in the three clean tenants was not technical sophistication. It was discipline that happened to pay off here. All three had a single backend engineer who, at some point between 2019 and 2022, had insisted on three things: every table gets a synthetic primary key, every timestamp is generated in the database, and the audit_log table is partitioned monthly with a retention policy enforced in code.

None of the three engineers had been thinking about logical replication when they made those choices. They were thinking about debuggability, reproducibility, and a GDPR right-to-be-forgotten request that landed once and embarrassed them. The audit benefited from decisions that were made for other reasons. This is the boring truth about why some clusters are easy to retrofit and others are not: the alignment is almost never planned.

The eleven that failed the survival check were not careless. They were the natural result of seven or eight years of just ship it, run by teams of one or two people. The remediation we quote them is not glamorous: add primary keys to four tables, move the timestamp default into the database, add a triggerless BEFORE INSERT check for monotonicity on the audit table, partition the largest two tables. Usually a single sprint, sometimes two.

The 47-minute version of the checklist

For a prospect call where we have one hour and a read-only Postgres role, this is the order we run it in:

  1. Confirm version, extensions installed, and shared_preload_libraries.
  2. Run the autovacuum-lag query above on tables greater than 40 GB. Note dead_pct and hours_since_vacuum for each.
  3. Check age(relfrozenxid) against autovacuum_freeze_max_age for those same tables.
  4. Pull the top 30 from pg_stat_statements. Flag any single query above 15% of total exec time.
  5. For each table the proposed agent will write to, check for primary key or REPLICA IDENTITY FULL.
  6. For the audit_log table specifically, check that the timestamp column has a server-side default and that the table has no triggers that could re-order rows under replication.
  7. Eyeball replication slot count, WAL size, and any existing publications.

Five of those seven items can fail in ways that block the agent project. None of them require us to know what the agent is going to do yet. That is the point: the cluster is either ready for new traffic or it is not, and the answer is independent of the LLM you are planning to wire in.

What we do with the score

The scored audit goes into the quote as a two-page appendix. If we found red items, those become a fixed-price preparation phase that runs before the agent work, and the agent quote is conditional on its completion. If we found yellow items, we note them as risks with a price band. If everything is green, we quote the agent work flat and we move fast.

When we built the customer-service agent for a Brabant-based wholesaler last quarter, the audit caught a 72 GB shipment_events table sitting at 26% dead tuples with no autovacuum for nine days. We spent the first week tuning autovacuum and partitioning the table by month before we wrote a single agent prompt. The agent shipped on schedule because the database was ready when it arrived, which is how most of our AI agent projects start: not with a model decision, but with a SELECT against pg_stat_user_tables.

If you do one thing today, run the autovacuum-lag query above against your own production replica and write down the worst dead_pct number you see. That single number, on a sticky note above your monitor, tells you more about whether your database is ready for an AI agent than any vendor brochure will.

Key takeaway

Run the autovacuum-lag query on your largest tables before you scope any AI agent work. The worst dead_pct number on your cluster is the real readiness score.

FAQ

Why 40 GB as the autovacuum threshold?

Below 40 GB the default scale factor of 0.2 produces vacuum cycles that finish inside a maintenance window. Above it, the same setting waits for 8 GB of dead tuples before running, which collides with business hours.

Can we just stay on Postgres 12 with extended support?

You can, but we will not quote a new AI agent on top of it. Security patches behind a paid contract are fine for legacy maintenance and bad for new build work the client expects to grow.

What if pg_stat_statements is not installed?

We install it during the audit window and come back in 72 hours. Without three days of representative data, the top-30 list is noise and any quote based on it is a guess.

How long does the full audit take?

About 47 minutes on a healthy cluster with a working read-replica and a Postgres role that can read pg_stat_user_tables and pg_stat_statements. Closer to a full day if extensions are missing or stats are stale.

ai agentsmigrationarchitectureoperationsstrategysecurity

Building something?

Start a project