Databases

Postgres 16 planner regression: an 11-hour RAG outage

Tuesday, 22:47 Amsterdam. PagerDuty fires for a Dutch insurer. Their claims-triage agent's RAG retrieval is 50x slower than it was an hour ago.

Jacob Molkenboer· Founder · A Brand New Company· 9 Jun 2026· 10 min

Open oak index-card drawer on ivory desk, one card tagged green, brass divider tilted, red wax seal fragment beside it.

Tuesday, 22:47 Amsterdam. PagerDuty fires for a Dutch insurer's claims-triage agent. The chat sits behind a RAG retrieval layer over twelve years of case files and policy PDFs. P95 retrieval latency, normally 80ms, is now 4.2 seconds. The customer-facing agent times out at 8 seconds, so every fifth conversation dies mid-sentence. The Eindhoven call centre has started manually overflowing into voicemail. We have until the morning shift at 09:00 to get this back under one second.

The symptom did not match any pattern in the runbook

Latency had spiked at 21:33. No deploy. No traffic surge, Tuesday evening is the quiet window. No GPU issues, embedding generation still ran in 11ms on the inference box. No load balancer drama. The only thing the dashboard agreed on was that the read path through Postgres was slow.

We started with the obvious. Was something locking the table? pg_stat_activity showed no blocked queries, no long transactions, no autovacuum stuck on the embedding table. CPU was elevated but not pinned. Disk IO had tripled. A vacuum that finished six hours earlier showed nothing weird in pg_stat_user_tables.

The agent's retrieval is one query. It looks roughly like this:

SELECT chunk_id, content
FROM rag_chunks
WHERE tenant_id = $1
  AND embedding <=> $2 < 0.3
ORDER BY embedding <=> $2
LIMIT 12;

tenant_id has a btree index. The embedding column has an HNSW index from pgvector 0.7.4. The query had been quiet at 80ms for six months. pg_stat_statements pointed at this query, and only this query, as the cause of the load.

The planner had changed its mind

EXPLAIN (ANALYZE, BUFFERS) showed it inside thirty seconds. The query was no longer using the HNSW index. It was doing a sequential scan over rag_chunks, then sorting by distance. For a fourteen million row table, on a tenant with 280,000 rows, that meant 4.2 seconds of pure CPU per call, and the connection pool was saturating.

Same query. Same data shape. Same tenant. New plan.

The planner's cost estimate for the HNSW index access path had moved. Not by much. Just enough to flip it past the sequential scan estimate. Once the planner decided seq scan was cheaper, every retrieval for every tenant landed on the same losing plan. The estimated row count for the index path had gone from 12 (sane, matches LIMIT) to 41,000 (clearly wrong for an HNSW operator), which pushed the cost from 0.42 to 9,847.

The fragile combination is HNSW plus ORDER BY plus a tight LIMIT. The HNSW operator's cost function returns a heuristic, not a measured row estimate, and when the planner's confidence in that heuristic drops by a fraction, the sequential scan starts looking like the safer bet. For a small enough LIMIT, the planner is almost right: a seq scan with an early stop could be cheap. With ORDER BY embedding, the early stop never fires, because the planner has to materialise every candidate row before it can rank them. The cost model knows this in theory. The constants moved enough to lose the bet in practice.

For background on reading a Postgres plan, the official EXPLAIN docs are the canonical reference. We had read them many times. We still spent twenty minutes staring at the wrong column before someone noticed the BUFFERS line and asked why the seq scan was reading 11GB.

The apt log told the story

At 21:31, exactly two minutes before the latency spike, the database host had run unattended-upgrades. Postgres 16 had stepped from one minor release to the next. pgvector stayed put.

$ grep postgresql /var/log/apt/history.log
Start-Date: 2026-06-08  21:31:04
Upgrade: postgresql-16:amd64 (16.4-1.pgdg22.04+1, 16.5-1.pgdg22.04+1),
         postgresql-client-16:amd64 (16.4-1.pgdg22.04+1, 16.5-1.pgdg22.04+1)
End-Date: 2026-06-08  21:32:11

The minor version notes included an adjustment to cost estimation for index scans with ORDER BY plus a tight LIMIT. The change was correct for most workloads. For the pgvector operator family, where the cost function had been quietly leaning on the older behaviour, it pushed the estimated cost above the sequential scan path. The HNSW index was still there. The planner had stopped picking it.

Warning

Never let unattended-upgrades touch your database host without a pinned minor version. A planner change one decimal place inside a routine security release can rewrite your production query plan without warning.

Three options at 23:30

By midnight we had three viable paths and a deadline that ended at 09:00.

Roll back to the previous minor. The previous .deb was still on the apt mirror. Apt downgrade on a running cluster is not graceful. The package was marked held-forward by the upgrader. Best case, 90 minutes plus a restart, with a real risk of a partial state if something went wrong at 01:00.

Brute-force the planner with GUCs. Setting enable_seqscan = off at the session level would force every other plan in the database to also avoid seq scans. The nightly reporting endpoints would not survive that. Acceptable as a tourniquet, ugly as a fix.

Hint the one query with pg_hint_plan. The pg_hint_plan extension lets you attach a comment to a query that tells the planner exactly which scan, join, and parallelism strategy to use. It has been the load-bearing workaround for production Postgres teams hitting planner regressions since 9.1. It is also the closest thing Postgres has to Oracle-style hints, which the community has historically refused to ship in core.

We picked the third option. Smallest blast radius, cleanest to remove later, and the runbook already contained two pg_hint_plan examples from a 2025 incident on a different client's billing query.

The 3am patch

We installed pg_hint_plan from the PGDG repository, added it to shared_preload_libraries, and reloaded the cluster. The Python retrieval call in the agent service was updated to send a hinted version of the same SQL:

/*+ IndexScan(rag_chunks rag_chunks_embedding_hnsw_idx) */
SELECT chunk_id, content
FROM rag_chunks
WHERE tenant_id = $1
  AND embedding <=> $2 < 0.3
ORDER BY embedding <=> $2
LIMIT 12;

The hint is the comment block. pg_hint_plan parses it before the planner runs and pins the scan method for the named table to the named index. Everything else is left to Postgres.

We deployed to one read replica first, ran 200 synthetic retrievals against it, and saw p95 settle at 84ms, four milliseconds above the long-term baseline. We promoted the change to the primary at 03:11. By 03:24 the dashboards were green and the call centre stopped seeing time-outs. Two trailing read replicas needed the same patch the next morning, which is why the incident timer ran to 11 hours and 9 minutes from first alert to all-clear across the fleet.

What we changed the next day

The post-incident review was short. Most of the work was prevention.

The Postgres minor version is now pinned on every database host in the fleet. The unattended-upgrades allowlist explicitly excludes postgresql-* and pgvector. Security patches for the database run on a deliberate Friday morning window, with a synthetic retrieval probe watching p95 in real time and a one-command rollback ready.

The synthetic probe itself is new. It runs every 60 seconds against a representative tenant, fires the same retrieval the agent fires, and pages on p95 above 500ms for three consecutive minutes. The probe would have caught the regression two minutes after it landed, instead of an hour and fourteen minutes later when the call centre noticed. We also added a daily ANALYZE on the embedding table, because the stale statistics were a quiet contributor to the bad row estimate.

The probe is fifteen lines of Python, on a one-minute cron:

import os, time, psycopg
from monitoring import push_metric, pagerduty

QUERY = open('rag_retrieval.sql').read()
EMBEDDING = load_canonical_embedding()  # cached, regenerated weekly
TENANT = os.environ['PROBE_TENANT_ID']

t0 = time.monotonic()
with psycopg.connect(os.environ['DSN']) as conn:
    conn.execute(QUERY, (TENANT, EMBEDDING)).fetchall()
elapsed_ms = (time.monotonic() - t0) * 1000

push_metric('rag.probe.latency_ms', elapsed_ms)
if elapsed_ms > 500:
    pagerduty.trigger('rag-probe-slow', value=elapsed_ms)

We tested the rollback path next, before we needed it again. The team ran the apt downgrade on a copy of the production cluster, watched the planner switch back to the HNSW path inside one minute of restart, and timed the full restore at seven minutes. The runbook now lists the exact apt command, the exact restart order, and the exact probe value that signals the rollback worked.

pg_hint_plan is now pre-installed on every Postgres host the studio touches. The next time we need a hint at 03:00, the extension will already be loaded and ready in shared_preload_libraries. The runbook explicitly says: if a retrieval latency alert fires and the apt log shows a Postgres or pgvector change in the last hour, run EXPLAIN before doing anything else.

The hints conversation is finally moving in core

Postgres has refused query hints in core for two decades on principle. The official wiki page on the topic argues, with some merit, that hints turn a planner problem into a code-maintenance problem, and that the right fix is better statistics or a better schema. That is true when you have time. It is not a 03:00 answer when a Dutch insurer is shipping voicemails to people calling about a flooded basement.

That argument is finally moving. A Postgres 19 hints proposal sat on the Hacker News front page this week, which is a useful signal that the patience for "improve your statistics" has worn thin outside the hackers list as well. The pg_hint_plan maintainers have been carrying this water since 9.1. Folding hints into core would let teams stop choosing between a shared-preload extension and a brute-force GUC, and would give incident reviewers a sanctioned escape hatch instead of an off-ramp through a third-party repository.

If Postgres 19 ships first-class hints, our runbook gets one line shorter. Until then, pg_hint_plan stays in production, and the hint comment in our retrieval layer stays where it is.

The smallest thing you can ship before the next apt log surprise

When we built that insurer's claims retrieval layer, the lesson we wrote into the postmortem was less about the specific regression and more about the failure mode. A working query that quietly stops being indexed is one of the hardest things to monitor for, because nothing breaks. It just gets slow. If you run a Postgres-backed RAG system, write the synthetic latency probe today, point it at one representative tenant, and page on p95 above 500ms. Five minutes of work, one query, one alert. That is the cheapest piece of insurance against the next minor release deciding your HNSW index is no longer worth using.

Key takeaway

A RAG query that quietly stops using its HNSW index is the failure mode you cannot see. Ship a synthetic latency probe before the next minor release.

FAQ

What is pg_hint_plan and is it safe to run in production?

It is a Postgres extension that lets you pin scan, join, and parallelism methods per query via SQL comments. It has been used in production since Postgres 9.1 and ships in the PGDG repos. It is safe, but the hints are debt: review them on every major upgrade.

Why would a minor Postgres upgrade change a query plan?

Minor releases occasionally adjust cost estimation, statistics handling, or operator class internals. Most queries are unaffected. Edge cases like pgvector HNSW scans with ORDER BY and a tight LIMIT can flip to a worse plan when the cost balance moves by a small amount.

How do you prevent unattended-upgrades from breaking the database?

Pin the Postgres and pgvector packages explicitly, add them to the unattended-upgrades blocklist, and patch the database tier during a planned window with a synthetic latency probe and a tested rollback path ready before you start.

Could a synthetic probe have caught this earlier?

Yes. A 60-second probe firing the real retrieval query and paging on p95 above 500ms would have alerted within two minutes of the apt upgrade, instead of more than an hour later when the call centre noticed the time-outs.

ragai agentsarchitecturecase studyoperations

Building something?

Start a project