Databases

Postgres audit for AI agent event tables: the checklist

It is Sunday night. Your agent_events table is 41 million rows. You need to delete 80% of it. The autovacuum worker has other plans for your evening.

Jacob Molkenboer· Founder · A Brand New Company· 3 Oct 2025· 7 min

Open wooden index-card drawer on ivory desk, brass divider with chartreuse tag, ledger paper stack, red wax seal.

It is Sunday night. The agent_events table is 41 million rows. The product team wants to cut 80% of it before the European workday starts, because the chat agent has been logging every tool call since March and the disk is now louder than the office fridge. You open psql. You type DELETE FROM agent_events WHERE created_at < now() - interval '30 days'. Then you remember the last time you did that.

The autovacuum worker had other plans for your evening.

This is the checklist we run on every Postgres event table that backs a long-running AI agent at ABN. Fourteen agents are live in production at the time of writing, and twelve of them write to a high-churn event log: tool calls, model responses, retry attempts, user feedback, the lot. Every one of those tables eventually outgrows its retention policy. The question is whether you find out on a Tuesday morning during planning, or at 23:47 on a Sunday when the disk-full alert wakes you up.

The HN thread that should be on the wall

A claim we keep open in a tab: the only scalable delete in Postgres is DROP TABLE. It overstates the point on purpose, the way good slogans do. The mechanic underneath is real. A row-by-row DELETE writes a dead tuple for every row, which means autovacuum has to walk the whole table to reclaim space, which means the table grows on disk even while you are trying to shrink it. Multiply that by 41 million rows in a single Sunday-night transaction and you have built yourself a long evening.

The way out is structural. Partition the table so each month of events is a child relation. Drop the child. The deletion is one DDL statement and a near-instant metadata change. Autovacuum stays asleep. Your disk graph drops in a clean step instead of a long swamp. The official partitioning docs are short and worth re-reading once a year.

So the first axis of the checklist is simple: how close is this table to being partitioned-by-range, and how far is it from being dropped one partition at a time?

The five-item audit

Run this against every event table you operate. Score each item 0, 1 or 2. A table at 8 or better can survive a 41M-row purge on a Sunday night without locking up autovacuum. A table at 5 or below is what woke you up.

1. Partition-by-range readiness

Two things matter. The table needs a monotonically increasing time column (created_at, event_ts, anything append-only). It needs no foreign keys pointing in from outside the partition tree, or if it has them, the FK has to include the partition key. Score 2 if the table is already partitioned by month. Score 1 if it has a clean time column and no awkward inbound FKs. Score 0 if half the schema references it by id.

-- quick check: is this table already a partitioned parent?
SELECT relname, relkind
FROM pg_class
WHERE relname = 'agent_events';
-- relkind 'p' = partitioned table, 'r' = plain table

2. Retention-policy lag in pg_stat_progress_vacuum

Look at pg_stat_progress_vacuum during a real workload, not in dev. If you see your event table in there for more than a few minutes at a stretch, your retention policy is behind reality. Score 2 if autovacuum never appears for this table outside of a freeze. Score 1 if it appears but finishes within an hour. Score 0 if the table is permanently in the scanning-heap phase.

SELECT
  p.pid,
  c.relname,
  p.phase,
  p.heap_blks_scanned,
  p.heap_blks_total,
  round(100.0 * p.heap_blks_scanned / nullif(p.heap_blks_total, 0), 1) AS pct
FROM pg_stat_progress_vacuum p
JOIN pg_class c ON c.oid = p.relid
ORDER BY pct DESC;

The progress reporting docs explain each phase. The one to watch for an event table is vacuuming indexes: if the table carries six btree indexes and you just deleted three million rows, autovacuum will spend most of its life there, not on the heap.

3. Index footprint

Every index is a tax on every write and every delete. On an event table that exists to be appended-to and then dropped, three indexes is generous. Five is suspicious. Eight means somebody added one per dashboard query and never came back. Score 2 if the table has at most three indexes and one of them is on the partition key. Score 1 if it has four to six. Score 0 if it has more than six, or if any sit on JSONB columns the dashboards no longer query.

SELECT
  s.indexrelname AS index_name,
  pg_size_pretty(pg_relation_size(s.indexrelid)) AS size,
  s.idx_scan AS scans_since_reset
FROM pg_stat_user_indexes s
WHERE s.relname = 'agent_events'
ORDER BY pg_relation_size(s.indexrelid) DESC;

The scans_since_reset column is the one that ends arguments. If an index has zero scans across a month of production traffic, it is not earning its keep on writes.

4. Bulk-delete survivability

This is the question the HN thread sharpens. Can this table be reduced by 80% in one operation without the database deciding to rewrite itself for the next eight hours? Score 2 if you can DROP TABLE an old partition. Score 1 if you have to DELETE in batches but a chunked job with LIMIT 50000 finishes under an hour. Score 0 if the only path is a single DELETE statement and a prayer.

Warning

Batched deletes still write dead tuples. They are kinder to replication lag than one giant transaction, but they do not save you from autovacuum. If a table scores 1 here, the real fix is to make it score 2 by next quarter.

5. Observability into the agent's write pattern

You cannot retention-trim what you cannot count. Score 2 if you know, off the top of your head, how many rows per minute the agent writes and what the median row size is. Score 1 if you can look it up in Grafana. Score 0 if you have to ask the agent's owner and they have to ask the model.

This one sounds soft. It is the most predictive of the five. Every event table we have ever had to rescue scored zero here first.

Three tables that survive a Sunday-night purge

From our own portfolio, the tables that pass the audit at 8 or better share three traits. They are worth naming because they are also the cheapest tables to operate.

The first is the canonical agent_events partitioned by created_at into monthly children, with one composite index on (agent_id, created_at). Retention is a cron that runs DROP TABLE agent_events_yYYYY_mMM for any month older than the policy. The purge is a metadata change. Autovacuum never sees it.

The second is tool_call_log. Same partition shape, plus a BRIN index on created_at instead of a btree. BRIN on a strictly append-only timestamp column is a few kilobytes per million rows and effectively free to maintain. The BRIN intro is worth reading once if you have not, because the pattern fits almost every event-log table cleanly.

The third is model_response_cache. This one is not partitioned, but it scores 8 because it is small (under a million rows), it has exactly one index, and its retention rule is trim-to-last-90-days run as a chunked batch every night during low traffic. Not every event table needs partitions. Some need fewer columns and an honest cron.

What the audit changes in practice

The point of scoring is not to grade. It is to make the next decision obvious. A table at 4 does not need a heroic Sunday-night intervention. It needs one quarter of work to make it partitionable, after which the Sunday-night intervention becomes a one-line cron.

We run this audit when we take over a database from a previous vendor, when an agent is about to ship to a tenth or hundredth of its current scale, and every six months on tables we already operate. It takes about an hour per database. It saves roughly one weekend per table per year, which is the only ROI calculation we have ever found persuasive at 02:00.

What we ran into at a client

When we built the chat agent for a logistics operator in Rotterdam, the thing we hit in month four was exactly the autovacuum-stuck-on-indexes pattern in item 2. The event table had grown to 28 million rows with seven indexes, three of which the dashboards no longer used. We dropped the unused indexes, partitioned by month going forward, and migrated history into the new shape over a weekend. The purge that had been blocked for three weeks ran in 200 milliseconds the following Sunday. That is the kind of work our AI agents practice does once the agent is live and the data layer becomes the bottleneck.

Pick one event table today. Run the five-item check. If anything scores zero, put a half-day on next week's calendar. That is the whole audit.

Key takeaway

If your event table cannot be reduced by 80% with one DROP TABLE statement, you are storing it wrong, not deleting it wrong.

FAQ

How often should I run this audit?

Every six months on tables you already operate, and immediately when you inherit a database from a previous vendor or platform. New tables get audited before they ship to production.

Is BRIN always better than btree for event tables?

For strictly append-only timestamp columns, yes. If you also need exact-match lookups on agent_id or user_id, keep a btree on those. BRIN replaces the timestamp btree, not every index.

Can I partition a table that already has 41M rows?

Yes, but the migration is heavier than partitioning from day one. Use pg_partman or do it in a maintenance window with a swap-table approach and a batched backfill.

Why not just use a TTL feature like in some NoSQL stores?

Postgres does not have a built-in TTL. Partitions plus a drop-old-children cron is the equivalent, and it gives you the metadata-only delete that survives heavy load.

ai agentsarchitectureoperationstoolingworkflow

Building something?

Start a project