Databases

Postgres audit: what we check before an analytics agent

Before we quote an analytics agent, we ssh into a replica and run six queries. Twenty minutes of audit. Here is the exact checklist we use.

Jacob Molkenboer· Founder · A Brand New Company· 16 Aug 2024· 7 min

Open oak index-card drawer with brass divider, cream ledger stack, green tab card, red wax seal on ivory paper.

The deal sheet says "analytics agent: pulls metrics from your Postgres, posts a daily digest to the founder's Slack." Before we send the quote, we ssh into a read replica and run six queries. They take about twenty minutes. Most of the time they tell us the agent will work fine. Roughly one client in four, they tell us we'd be billing a feature on top of a database that has been quietly broken for two years.

This post is the checklist. If you are a founder or an ops lead at a sub-€5M SaaS and someone is about to wire an agent into your Postgres, you can run these yourself before you sign anything. None of it requires writing application code. All of it is read-only.

What "broken" looks like from the outside

The agent we ship for a typical analytics request does three things: it issues a handful of aggregate queries on a schedule, it materialises results into a small reporting table, and it posts a summary somewhere a human will read it. The queries themselves are usually unsurprising. Counts, sums, joins across two or three tables, a window function or two.

The agent is cheap. The database under it is the variable. We have seen the same SQL run in 80 ms on one client and 14 seconds on another, against tables of similar size. The 14-second case was not a query problem. It was a maintenance problem the team had inherited and never been told to look at.

So we audit. Six queries, one psql session, half an hour. The list below is what we actually run.

Table bloat, the first surprise

Postgres uses MVCC, which means an UPDATE does not overwrite a row. It writes a new version and marks the old one dead. Autovacuum is supposed to reclaim those dead rows. When it can't keep up, or was never tuned to, tables grow with ghost data. Sequential scans slow down. Index scans slow down. Backups balloon. Nobody notices, because the application still works.

Run this:

SELECT
  schemaname,
  relname,
  n_live_tup,
  n_dead_tup,
  round(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
  last_autovacuum
FROM pg_stat_user_tables
WHERE n_live_tup + n_dead_tup > 10000
ORDER BY dead_pct DESC NULLS LAST
LIMIT 20;

If the top of the list shows tables above 30% dead tuples, the agent is going to inherit that drag every time it scans them. We have seen a 9 GB events table where 6 GB was dead. The fix is not exotic. A VACUUM (VERBOSE, ANALYZE) on the worst offenders, then a conversation about why autovacuum stopped being aggressive enough. Which brings us to:

Autovacuum on settings nobody tuned

Postgres ships with autovacuum_vacuum_scale_factor set to 0.2. Translation: autovacuum kicks in when 20% of the rows in a table have changed. On a 50,000-row users table, that's 10,000 rows of churn, fine. On a 200 million-row events table, that's 40 million rows of churn before vacuum runs once. By then the index pages are a mess.

Check the cluster settings, then check which tables override them:

SHOW autovacuum_vacuum_scale_factor;
SHOW autovacuum_analyze_scale_factor;
SHOW autovacuum_naptime;
SHOW autovacuum_vacuum_cost_limit;

SELECT relname, reloptions
FROM pg_class
WHERE reloptions IS NOT NULL;

The first four show the cluster-wide defaults. The last shows which tables have per-table overrides. On three out of four databases we audit, that last query returns nothing, meaning every table is on the same default that was sensible in 2009.

The fix on a hot table is per-table:

ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_analyze_scale_factor = 0.01
);

That's "vacuum when 2% of rows have changed", which on a 200M-row table means every 4M rows of churn instead of every 40M. The official Postgres documentation on routine vacuuming is the reference. It is long, but the per-table override section is the one to read first.

Warning

Do not run VACUUM FULL on a live primary because someone on Stack Overflow suggested it. It takes an exclusive lock and rewrites the entire table. We use pg_repack or off-hours maintenance windows for that. On the audit pass, we report and recommend. We do not repack.

The foreign keys that aren't there

This is the one that catches application teams off guard. They wrote the schema five years ago, the ORM enforced the relationship at the application layer, and at some point a migration added a column called organization_id without a foreign key constraint. The data is consistent enough. Until it isn't.

For an analytics agent, missing FKs matter for two reasons. First, the planner uses them. Postgres can elide joins, infer cardinality, and pick better plans when it knows two columns are linked. Second, dangling rows skew aggregates. If 0.3% of your invoices point to a deleted customer, your "revenue per customer" report quietly under-counts.

This query finds the suspects:

SELECT c.table_schema, c.table_name, c.column_name
FROM information_schema.columns c
WHERE c.column_name LIKE '%\_id' ESCAPE '\'
  AND c.table_schema NOT IN ('pg_catalog', 'information_schema')
  AND NOT EXISTS (
    SELECT 1
    FROM information_schema.key_column_usage kcu
    JOIN information_schema.table_constraints tc
      ON tc.constraint_name = kcu.constraint_name
     AND tc.table_schema    = kcu.table_schema
    WHERE tc.constraint_type = 'FOREIGN KEY'
      AND kcu.table_schema = c.table_schema
      AND kcu.table_name   = c.table_name
      AND kcu.column_name  = c.column_name
  )
ORDER BY c.table_schema, c.table_name;

It is a heuristic. Anything named *_id that isn't covered by an FK constraint. You will get false positives (a stripe_payment_intent_id, a legacy_external_id). Walk down the list with the engineer who knows the schema. The honest ones will be obvious within ten minutes.

Before adding the missing constraint, check for orphans:

SELECT COUNT(*)
FROM invoices i
LEFT JOIN customers c ON c.id = i.customer_id
WHERE i.customer_id IS NOT NULL
  AND c.id IS NULL;

If the count is zero, you can add the FK in a single transaction. If it isn't, you have a data-cleanup conversation to schedule, and that is a useful thing to discover before the analytics report goes live, not after a board meeting where the numbers don't tie out.

Stale planner statistics

The ANALYZE command updates the statistics the query planner uses. When stats are stale, Postgres builds plans for a table that hasn't existed in that shape for six months. The view pg_stat_user_tables exposes last_analyze and last_autoanalyze, and on busy tables with default autovacuum settings those timestamps can be weeks old.

SELECT
  schemaname, relname,
  last_analyze, last_autoanalyze,
  n_live_tup
FROM pg_stat_user_tables
WHERE n_live_tup > 100000
ORDER BY GREATEST(
  COALESCE(last_analyze,     '1970-01-01'::timestamp),
  COALESCE(last_autoanalyze, '1970-01-01'::timestamp)
) ASC
LIMIT 20;

The fix is usually a one-liner: ANALYZE schema.table;. The deeper question is why autovacuum is not running often enough, which is the same question as the autovacuum section above. See the monitoring stats documentation if you want to understand exactly what each column means.

Indexes that index nothing

Last query, lowest stakes, easiest to act on. Unused indexes are dead weight. They consume disk, they slow writes, and they make every VACUUM longer. On a typical audit we find between four and twenty of them, often left over from migrations someone ran in 2021 and never followed up on.

SELECT
  schemaname,
  relname     AS table_name,
  indexrelname AS index_name,
  idx_scan,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan < 50
  AND schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;

The idx_scan < 50 threshold is a soft floor. Adjust to taste. Anything that has been on the table for a year and scanned fewer than fifty times is a strong candidate to drop. Confirm with the team first. Sometimes an index exists because it backs a UNIQUE constraint, in which case the count is irrelevant.

Takeaway

If the database is healthy, the analytics agent is small. If the database is sick, the agent inherits every sickness, and the team blames the agent.

What we hand back before the quote

A two-page document. Page one: the six queries above, run against the client's database, with the actual numbers summarised. Page two: which findings are blockers ("we cannot run the agent reliably until X"), which are recommended fixes inside the engagement, and which are out of scope but worth knowing.

About a quarter of the time, the audit changes the shape of the deal. The analytics agent gets a smaller scope. A two-week database-hygiene engagement gets added in front of it. Or the client decides to do the hygiene work in-house first and come back in a month. Any of those outcomes beats discovering the same problems three weeks into a build.

When we wired the daily-digest agent for a logistics SaaS in Rotterdam earlier this year, the audit turned up 6 GB of bloat on their shipments table and three missing FKs on the table the digest needed to join against. We rolled the cleanup into the same statement of work as the AI agent build, and the digest ran in 1.2 seconds instead of the 11 we would have shipped without it.

The smallest thing you can do today: open psql against a replica, paste the bloat query at the top of this post, and read the top five rows. If anything is above 30%, you have a conversation to schedule before you wire anything new into it.

Key takeaway

If the database is healthy, an analytics agent is a small build. If the database is sick, the agent inherits every sickness and gets blamed for all of it.

FAQ

Can I run these audit queries on the primary, or do I need a replica?

All six are read-only and cheap. We prefer a replica out of habit, but pg_stat_user_tables and information_schema lookups don't block writes. The only thing that's not safe in business hours is VACUUM FULL.

How often should autovacuum run on a busy table?

Often enough that the dead tuple percentage stays under about 10%. On large hot tables that usually means a per-table scale factor between 0.01 and 0.05, not the cluster default of 0.2.

Do missing foreign keys actually slow queries down?

Yes, in two ways. The planner can't elide joins or infer cardinality without the constraint, and orphan rows skew aggregates. Both matter when an agent is running scheduled analytical queries.

What if the bloat query shows a table at 60% dead tuples?

Don't panic and don't run VACUUM FULL on a live primary. Schedule a maintenance window, run pg_repack, then tune the per-table autovacuum settings so it doesn't drift back. Investigate the workload causing the churn.

ai agentsarchitectureoperationstoolingsaasintegrations

Building something?

Start a project