Databases
Postgres deletes under AI workloads: 17 ways tables die
A delete-heavy embeddings table, an AI agent that quietly slows down, and one pg_stat_user_tables row that explains it. Seventeen failure modes, ranked.

It's 2am in Amsterdam. The on-call dashboard says your retrieval agent's p95 latency walked from 280ms to 4.1 seconds overnight. Recall on the offline eval set dropped four points. The Postgres box is at 18% CPU, so it isn't "the database is busy". You SSH in, run a single query against pg_stat_user_tables, and the agent_memories row stares back with 240 million dead tuples sitting on top of 38 million live ones.
This is the shape of the bug. An AI agent that does retrieval-augmented work is a delete-heavy system. You write embeddings, you TTL them out, you re-embed when the model changes, you dedupe near-identical chunks, you garbage-collect stale conversation context. Postgres handles deletes, but it handles them on its own terms, and a vector table makes those terms harsher than the documentation lets on.
Vector tables hit Postgres differently
Three things make a vector-search table different from a normal CRUD table under heavy deletes.
First, the vector column itself. A 1536-dimensional float array is roughly 6 KB. That's well past the 2 KB threshold where Postgres pushes the column out to a TOAST table. Every embedding row is really two rows in two tables, and every delete leaves dead tuples in both.
Second, the indexes. HOT updates, the cheap in-place trick that keeps normal tables from bloating, only fire when no indexed column changes. A vector index covers every row, so HOT is effectively off. Worse, pgvector's HNSW and IVFFlat indexes don't reclaim space the way B-tree does. HNSW marks deleted nodes as tombstones and keeps them in the graph until a full rebuild.
Third, the workload pattern. Agents tend to write in bursts (a re-embedding job, a nightly dedupe pass) and delete in bursts. Autovacuum is tuned for a steady drip. A million-row delete that completes in 90 seconds will sit on the table for hours before autovacuum gets to it on default settings.
Seven failures you can see in pg_stat_user_tables tonight
These are the cheap ones. One query, one screenshot, and you know whether you have a problem. The monitoring views are already on; you just have to read them.
SELECT
relname,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
n_mod_since_analyze,
last_autovacuum,
autovacuum_count,
n_tup_hot_upd,
seq_scan,
idx_scan
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_dead_tup DESC
LIMIT 10;
The seven things to read out of those columns:
- Dead-tuple ratio above 20%. A vector table that's 20% dead is already losing planner accuracy. Above 50%, you're paying for an index that no longer fits in cache.
- last_autovacuum hours behind n_mod_since_analyze. Autovacuum is awake but losing the race. The default
autovacuum_vacuum_cost_limitof 200 was written for spinning disks. On NVMe it should be 2000 or higher. - autovacuum_count near zero on a hot table. The cost limit is being hit on every pass and the worker is bailing before it finishes.
pg_stat_progress_vacuumwill show one running for hours. - n_tup_hot_upd at zero. Confirms the vector index is blocking HOT. Not fixable without dropping the index, but useful to know: every UPDATE on this table is paying full bloat tax.
- seq_scan climbing on an indexed table. Planner cardinality is stale.
reltuplesinpg_classis now lying about row count and the planner thinks a sequential scan is cheaper than the HNSW lookup. - n_ins_since_vacuum tracking n_dead_tup. A classic insert-then-delete pattern. Often a dedupe job that writes then immediately removes near-duplicates. Worth fixing in the agent code before tuning the database.
- idx_scan flat while seq_scan rises. The vector index has been demoted by the planner. Recall is now whatever your
LIMIThappens to surface from a sequential scan, which is not what you advertised in your eval doc.
Six failures hiding one query deeper
These don't show in pg_stat_user_tables. You need pg_stat_activity, pgstattuple, or the WAL directory itself.
- TOAST bloat. The vector column is TOAST'd. Run
SELECT pg_size_pretty(pg_relation_size(reltoastrelid)) FROM pg_class WHERE relname = 'agent_memories';. If the TOAST table is bigger than the main heap, your bloat lives there, not where you were looking. - A stuck xmin horizon. One long-running transaction, an abandoned replication slot, or a forgotten prepared transaction will pin the xmin and stop VACUUM from removing any dead tuple newer than that snapshot.
SELECT pid, backend_xmin, state, query_start FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin LIMIT 5;finds the culprit. - WAL volume spiking. Every dead tuple is a WAL record. A delete-heavy workload can double your WAL throughput and stall replication. Check
pg_stat_waland the size of thepg_wal/directory. - Visibility map churn. Index-only scans depend on the visibility map. Heavy deletes flip the all-visible bit off on pages that get rewritten constantly, and the planner silently switches off the fast path.
- FSM mis-tracking. The free space map thinks pages have room they don't. New inserts go to extend the relation instead of reusing freed space, and the table grows on disk while the bloat percentage stays flat.
- Checkpoint storms.
pg_stat_bgwritershowscheckpoints_reqclimbing faster thancheckpoints_timed. Delete batches are pushing the WAL pastmax_wal_sizeand forcing checkpoints that compete with the agent's queries for I/O.
Four failures only DROP TABLE will fix
There is a community truism among Postgres operators: at a certain scale, the only DELETE that finishes in a reasonable amount of time is DROP TABLE. The framing is sharp on purpose, but for the four failures below it is also literally correct.
- HNSW graph degradation. Deleted vectors are tombstoned, not unlinked. The graph still walks through them on every search. After a few million tombstones, recall quietly drops and latency climbs.
REINDEX CONCURRENTLYon a billion-row table is a multi-hour operation that will starve the agent for the duration. - IVFFlat centroid drift. Your lists were computed against the data distribution you had six months ago. Deletes and re-embeds have moved the centroids. Quality decays gracefully, then falls off a cliff once a list goes empty. No VACUUM setting fixes a centroid.
- Heap bloat past 70%.
VACUUM FULLneeds an ACCESS EXCLUSIVE lock for the entire rewrite. On a hundred-gigabyte table that's hours of downtime.pg_repackavoids the lock but needs free disk space equal to the table size and will compete with the live workload the whole time. - Page fragmentation across TOAST'd rows. A 6 KB vector row leaves a 6 KB gap when deleted. The FSM cannot recombine those gaps into useful free space. Even with autovacuum running cleanly, the heap grows. Nothing short of a rewrite reclaims it.
The partition-swap pattern in concrete terms
If you reach the bottom four, the honest fix is not tuning. It is a maintenance window. The cheapest version of that window is a partition swap. Design for it from day one: a partitioned table you can drop a week of at a time will outlive a monolithic one you have to VACUUM FULL twice a year.
-- Original table, range-partitioned by week
CREATE TABLE agent_memories (
id uuid PRIMARY KEY,
tenant_id uuid NOT NULL,
embedding vector(1536) NOT NULL,
content text NOT NULL,
created_at timestamptz NOT NULL
) PARTITION BY RANGE (created_at);
CREATE TABLE agent_memories_2026w24
PARTITION OF agent_memories
FOR VALUES FROM ('2026-06-08') TO ('2026-06-15');
-- Sunday 03:00 maintenance
BEGIN;
ALTER TABLE agent_memories DETACH PARTITION agent_memories_2026w24;
COMMIT;
DROP TABLE agent_memories_2026w24;
The DETACH is fast and takes a brief lock. The DROP is instant. No VACUUM, no REINDEX, no waiting for autovacuum to catch up. You give back the disk, the bloat, the TOAST entries, and the HNSW graph fragment in one transaction.
The cost is that you must decide retention up front. A week of agent memory has to be enough, or two weeks, or whatever your product needs. Once you have committed to a window, the operational model gets dramatically simpler: every Sunday the oldest partition becomes a candidate for swap, and the swap is a script, not a project.
A diagnostic order of operations
If you're triaging a slow agent right now, run the queries in this order. Stop at the first one that explains what you're seeing.
- The
pg_stat_user_tablesquery above. Dead-tuple ratio and last_autovacuum. Two minutes. pg_stat_activityfor a stuck xmin. One minute.- TOAST size for the suspect table. One minute.
pgstattuplefor actual heap bloat, not just dead tuples. Five minutes, and it does read the whole table.- Index size and last
REINDEX. One minute. - If steps 1 through 5 are clean and the agent is still slow, the problem is centroid drift or HNSW tombstones, and you are looking at a rebuild or a swap.
Do not run VACUUM FULL on a vector-indexed table during business hours. It takes ACCESS EXCLUSIVE, blocks every read the agent depends on, and the operation is not interruptible without losing the work done so far.
The smallest useful thing you can do today
Run the pg_stat_user_tables query once. Save the output. Add a cron that runs it every six hours and alerts when the dead-tuple ratio on any agent table crosses 25%. You will spot the first eleven failures on this list weeks before they hurt the agent. The bottom four still need a Sunday night, but a partition-swap Sunday night is two hours of low-drama work, not a 6am incident.
When we built the retrieval layer for a SaaS client running eleven million embeddings across three tenants, we hit failure 14 at month three: HNSW tombstones quietly pushed recall down with no alert firing. We solved it by partitioning agent_memories by week and writing a 40-line swap script that runs every Sunday at 03:00 Amsterdam. The work behind that fix is what we mean when we talk about AI agents as a product surface rather than a demo.
Key takeaway
Vector-indexed Postgres tables with delete-heavy workloads need partition-swap retention from day one, not VACUUM tuning.
FAQ
How often should I check pg_stat_user_tables on a vector table?
Once every six hours is enough for most workloads. Alert when the dead-tuple ratio on any agent table crosses 25%. Daily is too slow; a delete-heavy job can blow past 50% bloat in an afternoon.
Is REINDEX CONCURRENTLY enough to recover an HNSW index?
Yes for the index itself, but it can take hours on a large table and competes with live queries for I/O. Above ten million rows, partition-swap is faster and more predictable.
Does pgvector's IVFFlat have the same tombstone problem as HNSW?
Slightly different. IVFFlat does not carry tombstones the same way, but list assignments drift as the underlying distribution changes. Both end up needing a rebuild eventually.
What is the minimum partition window that still makes the swap pattern work?
A day works if you have very high turnover. A week is the most common choice. Below a day, swap overhead and partition pruning costs start eating the win.