Databases

Postgres event tables: the partition-swap pattern we ship

A 41M-row nightly DELETE locked a municipal-data consultancy's Postgres for nine hours. Here is why we ship monthly partitions on every agent event table.

Jacob Molkenboer· Founder · A Brand New Company· 10 Sept 2025· 9 min

Open oak index-card drawer with cream cards, brass divider, green ribbon marker, and red wax seal on ivory paper.

The Slack message landed at 06:47 on Wednesday. The infra lead at a 28-person municipal-data consultancy in Zwolle had been pinged by Pingdom: their main API was timing out, the gemeente dashboards were blank, and Postgres CPU had been pegged at 100% since just after midnight. The CEO was on a 09:00 call with a city customer. The standby replica was lagging by four hours. Nobody had touched the database in production for six weeks. By Friday afternoon the events table was partitioned by month and the retention cron was a one-line DROP.

The cron that broke everything was twelve characters of intent and a single semicolon:

DELETE FROM events
WHERE created_at < now() - interval '90 days';

That statement chewed through 41 million rows. It started at 02:00, ran headfirst into the autovacuum worker that fires at 02:30, and locked the table in a way no normal query could route around. By 06:47 it had finished the delete, but the table had hours of bloat to clear, the index was a graveyard, the WAL had blown past replication's ability to catch up, and every read query was waiting on a cleanup process that refused to release.

This is the story of why we stopped writing DELETEs against event tables at all.

The shape of the failure

There were three failures running at the same time, and only one of them was the obvious one.

The first was the cron itself. Forty-one million rows is not a query, it is a workload. Postgres uses MVCC, which means a DELETE does not free space, it marks rows as no-longer-visible. Forty-one million row versions had to be marked, the index entries had to be left in place for vacuum to clean up later, and every write to the table now had to step over the dead tuples. Disk usage actually grew during the delete because the WAL was logging every change.

The second was the autovacuum collision. Autovacuum had been about to do its scheduled scan of events at 02:30. Instead it found the table under a row-exclusive lock from the delete. Autovacuum waits politely. While it waits, the dead-tuple counter on the table keeps climbing. By the time the delete finished, autovacuum had hours of work backed up, and it ran in single-threaded chunks against a table that was being read by every API request.

The third was replication. The standby was streaming WAL faster than it could apply it. Forty-one million row deletes meant forty-one million WAL records, plus index entry cleanup, plus eventual VACUUM records. The standby fell behind, and the read replica the BI tool depended on stopped being useful.

The customer-facing damage was worse than the database log made it look. Every gemeente employee who opened a dashboard between 06:50 and 11:20 saw a spinner. The BI tool that ran council-meeting reports had been pointing at the lagging replica, so the numbers it returned were four hours stale without flagging it. Two cities filed support tickets. One of them later cited the incident in their renewal negotiation.

Warning

If your event table is more than 10M rows and you run scheduled DELETEs against it, you do not have a maintenance job. You have a latent outage waiting for a Wednesday.

Why this keeps happening

The pattern is everywhere because the syntax is harmless. DELETE FROM events WHERE created_at < ... looks like a tidy-up. Every team writes one. Most never hit the wall because their event table is small, or the cron runs at a quiet hour, or autovacuum happens to not collide. The Zwolle team had been running this query nightly for two years. It worked fine until it did not.

The thing that bites is that the cost of a DELETE on a large table is not proportional to the number of rows you delete. It is proportional to the work the database does to reconcile the deletion. That work happens later, asynchronously, against the same hot path that serves live traffic. A 41M delete is not "delete 41M rows." It is "delete 41M rows, mark 41M index entries as dead, generate roughly 6GB of WAL, ship that WAL to the standby, queue autovacuum work, hold a row-exclusive lock for an hour, then bloat the table by 30% until the next aggressive vacuum."

The conclusion we ended up at, scribbled on the whiteboard after the postmortem, was the same one every Postgres performance writeup eventually reaches. If you want to remove rows fast, without bloat, without autovacuum drama, without a hot-table lock, the only operation that does all of those things is a DROP. Everything else trades one cost for another.

The pattern we now ship

For every long-running agent we deploy, the event table is partitioned by month using native declarative partitioning. Deletion is never a DELETE. It is a DROP of the oldest partition.

The base table looks like this:

CREATE TABLE events (
    id          bigint GENERATED ALWAYS AS IDENTITY,
    agent_id    uuid          NOT NULL,
    kind        text          NOT NULL,
    payload     jsonb         NOT NULL,
    created_at  timestamptz   NOT NULL DEFAULT now(),
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

Two things to notice. The primary key includes created_at because Postgres requires the partition key to be part of any unique constraint on a partitioned table. And there is no foreign key from events to anything heavy; events are append-only and disposable by design.

Each month gets its own partition:

CREATE TABLE events_2026_06
    PARTITION OF events
    FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');

CREATE INDEX events_2026_06_agent_idx
    ON events_2026_06 (agent_id, created_at DESC);

The agent writes to events and Postgres routes the row to the right child. Reads against events use partition pruning, so a query filtering on created_at >= now() - interval '24 hours' only touches the current and previous month's partitions.

Deleting ninety days of data is now this:

DROP TABLE events_2026_03;

That is the entire job. The operation takes an ACCESS EXCLUSIVE lock for the duration of the metadata change (milliseconds), drops the file on disk, releases the lock. No row scanning, no per-row WAL, no dead tuples, no autovacuum collision. Replication ships one DDL record.

Takeaway

If your retention rule is "older than N days," partition by time and make retention a metadata operation. The database stops being a participant in your cleanup logic.

Provisioning partitions ahead of time

The trap with manual partitioning is forgetting to create next month's partition. If June ends and events_2026_07 does not exist, the next insert errors out. We use pg_partman to provision and retire partitions on a schedule:

CREATE EXTENSION IF NOT EXISTS pg_partman;

SELECT partman.create_parent(
    p_parent_table  => 'public.events',
    p_control       => 'created_at',
    p_interval      => '1 month',
    p_premake       => 4
);

UPDATE partman.part_config
SET retention = '90 days',
    retention_keep_table = false
WHERE parent_table = 'public.events';

p_premake => 4 keeps four months of future partitions ready. retention_keep_table = false tells pg_partman to actually DROP partitions older than ninety days when the maintenance job runs, rather than just detach them. A nightly call to SELECT partman.run_maintenance() handles the rollover and the retention sweep in one pass. It finishes in well under a second on a table with hundreds of millions of rows because it never reads a row.

Migrating a live table

The Zwolle cleanup had to happen on a table that already held two years of data and was being written to every second. We did not pause writes. The migration ran in this order:

Create the new partitioned table as events_partitioned with the same columns and indexes, parented by RANGE.
Create partitions covering the historical range plus four months forward.
Copy historical rows in batches of 500k, oldest first, with INSERT INTO events_partitioned SELECT ... FROM events WHERE created_at BETWEEN ... AND .... Run this against the standby's WAL position so the live table keeps serving writes.
Once the copy reaches "yesterday," take a sixty-second write lock, copy the tail, swap the table names, point the application at the new table, release.
Drop the old table the next day, after a backup confirms.

ALTER TABLE events RENAME TO events_old;
ALTER TABLE events_partitioned RENAME TO events;

The application code did not change. The agent kept writing. The window where writes were blocked was under a minute. We ran it at 03:00 Amsterdam time on a Sunday.

What it cost, what it saved

Before partitioning, the events table was 78GB on disk, the nightly delete generated about 6GB of WAL, autovacuum ran for two to three hours after every delete, and replication lag spiked to forty minutes on a normal night and four hours on a bad one.

After partitioning, the table is the same size on disk minus the rolled-off month. Retention runs in 180 milliseconds. WAL traffic for the retention job is two DDL records. Autovacuum on the table only touches the current month's partition, which is small enough to finish in under thirty seconds. Replication lag during retention is unmeasurable.

One thing we did not plan for. After the swap, the query planner picked partition pruning plus a per-partition index scan on the hottest read endpoint, where the single-table layout had been doing a sequential scan with a date filter behind a too-broad statistics target. Median read latency on that endpoint dropped from 23ms to 6ms without us touching a query. The cost-based planner does its homework once it has the partition boundaries to work with.

The 06:47 incident has not recurred. Neither has any other table-wide lock event on that database.

The broader pattern

This is the same logic the Postgres documentation hints at when it warns about VACUUM cost on large tables. MVCC is a beautiful concurrency model and a brutal cleanup model. The trick is to never let cleanup happen at row granularity if you can help it. Partitions turn cleanup into a file-system operation. Indexes on the parent table become local to each child, so they stay small and fast. Backups can skip cold partitions. Read queries get partition pruning for free.

You do not need a 41M-row event table to benefit. A 5M-row table that grows linearly will hit the same wall in eighteen months. Partition early.

Where this pattern is overkill

Partitioning is not free. Each partition adds catalog overhead, query planning grows a few milliseconds slower as you cross the dozens-of-partitions mark, and the application has to be comfortable with DDL happening on a schedule underneath it. We do not partition every table. The rule of thumb: if the table is append-mostly, time-ordered, and retention is a function of age, partition it. If it is a slowly-mutating reference table, an aggregate snapshot you rebuild from scratch monthly, or a join target sitting at the wrong end of a foreign key, the overhead outweighs the win. An 80,000-row accounts table does not need this. The events tables, audit logs, message queues, and analytic firehoses that AI agents produce do, because they share the same shape: time-indexed, append-only, retention-bounded, and never the target of a join with FK constraints pointing in.

When we built the agent infrastructure for the Zwolle consultancy's tenant-archival service, the event table was the load-bearing wall we did not see until it cracked. Every agent in our AI agents stack now ships with monthly partitions and a pg_partman config on day one, because the database posture matters as much as the model layer.

What you can do today

Open psql against your largest write-heavy table. Run SELECT pg_size_pretty(pg_total_relation_size('your_table')) and SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'your_table'. If the table is over 10GB or dead tuples are more than 20% of live tuples, your next retention cron is the one that will hurt. Sketch the partition boundary you would draw, and put a calendar entry for the migration before the table grows past the size where the swap window is comfortable.

Key takeaway

If your retention rule is time-based, partition the table and make cleanup a DROP. The only Postgres delete that scales without bloat or lock contention is DROP TABLE.

FAQ

Why is a large DELETE so much slower than expected in Postgres?

Because MVCC marks rows as dead rather than freeing space. Autovacuum cleans up later, which means a big DELETE generates bloat, WAL traffic, and lock contention long after the statement returns.

Do I need pg_partman to use partitioning?

No. Postgres has native declarative partitioning since version 10. pg_partman automates partition creation and retention, which saves operational work, but plain SQL and a cron also work.

Can I partition an existing live table without downtime?

Yes, with a swap migration. Build the partitioned copy, backfill in batches, take a brief write lock to catch the tail, rename. Most tables can switch over in under a minute of blocked writes.

What partition interval should I use?

Match it to your retention rule. If you keep ninety days, monthly partitions give you three live months plus one rolling drop. Weekly partitions fit tighter retention windows but multiply your partition count.

architectureoperationscase studyai agentstooling

Building something?

Start a project