RAG

pgvector REINDEX broke our RAG: an incident walkthrough

A junior engineer ran REINDEX on a vector store at 14:21 on a Friday. By 15:07 the RAG agent was citing a clause of NEN-EN-IEC 60601-1 that does not exist.

Jacob Molkenboer· Founder · A Brand New Company· 19 Oct 2025· 10 min

Half-open oak card drawer with cream catalogue card, brass divider, green ribbon edge, red stamp on ivory paper.

The page came in at 15:07 on a Friday afternoon in Eindhoven. A 31-person medtech SaaS we work with had a clinical customer on the phone, reading aloud from their RAG agent's answer. The agent had cited NEN-EN-IEC 60601-1, clause 11.6.7 as the source for a sterilisation re-validation interval. The clause does not exist. The compliance lead on the other end of the call had the standard open on her desk. Section 11.6 stops at 11.6.5.

We start there because the failure mode was not "the model lies." The model was retrieving the wrong chunks and confabulating around them. By the time we had the customer's CTO on the bridge, the same agent had served two more fictional sub-clauses to two other tenants. The common factor was a single git commit pushed at 14:18 by a junior engineer, with the message reindex pgvector for the weekend warmup.

This is the walkthrough of what we found, and the four-step recall rig we now run before any retrieval layer ships to a clinical tenant.

What we actually saw in the logs

The agent itself was healthy on every dashboard. Every endpoint returned 200. Token usage was flat. The retriever was returning the requested top_k = 6 chunks per query, with cosine distances that looked normal at a glance. None of the obvious alarms fired.

What stood out was a single line in postgres.log from 14:21:

LOG:  duration: 42087.114 ms  statement: CREATE INDEX docs_embedding_hnsw_idx
    ON docs USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 16);

Their production rebuild normally takes about eighteen minutes on this dataset. Forty-two seconds was the first thing we asked them to explain.

HNSW parameters, briefly

pgvector's HNSW index uses two build-time knobs: m, the number of bidirectional links per node, and ef_construction, the size of the candidate list the build process searches when wiring each new node into the graph. Higher ef_construction means a denser, more accurate graph and a much slower build. The pgvector README spells out the trade-off and the project defaults (m = 16, ef_construction = 64).

This team had been deliberate. Their production index was built with ef_construction = 200 because their clinical content is dense, technical, and asked about in narrow ways. They had chosen recall over build time, on purpose, after running their own benchmarks. The rebuilt index had ef_construction = 16. An order-of-magnitude sparser graph, in the same table, under the same name, serving every tenant.

The REINDEX that wasn't a REINDEX

The junior engineer's commit ran a single command from the repo root:

make reindex-vectors

The Makefile target was added six months earlier by an engineer who has since moved on. It reads:

reindex-vectors:
	psql $(DATABASE_URL) -f db/migrations/reindex_pgvector.sql

And reindex_pgvector.sql is not a REINDEX. A real REINDEX would have preserved the existing index definition. The script is a DROP INDEX followed by a CREATE INDEX, with parameters copied verbatim from db/seed/local_pgvector.sql. The seed file's author had set ef_construction = 16 so their docker-compose came up in under a minute on a laptop. Nobody had revisited it. The junior engineer reasonably trusted the target name.

Warning

If your "routine" reindex script is older than your retrieval team, read the SQL it actually runs before you let it touch production. The fast path is rarely the recall path, and the file name is not the contract.

Silent collapse, no errors raised

Nothing crashed. Postgres did not warn. pgvector did not warn. The CI integration suite passed, because the smoke tests ask things like "what is the dosage of paracetamol for adults," and the answer chunks rank highly even in a sparse graph. The damage was concentrated in the long tail.

Clinical questions are precise. A query like "what is the maximum re-test interval for the dielectric strength test in clause 8.8.3" lands in a region of vector space that needs dense graph links to reach the right chunk. With ef_construction = 16, the graph routed those queries to adjacent chunks: insulation tests, electrical safety language, the right document and the wrong section. The retrieved context was plausible but did not contain the clause number the user asked about.

The model then did what models do under thin context. It produced a plausible-looking clause number from the same numbering scheme as the surrounding chunks. Production-grade models hallucinate less often than smaller ones, but they also hallucinate more convincingly. By the time anyone notices, your customer is on a call with a hospital reading the number out.

This is the failure mode 200-response health checks were built to miss. Latency was actually better, not worse: a sparser HNSW graph traverses faster. Throughput was identical. The retriever returned the requested six chunks every time, with cosine distances inside the usual band. The one metric that would have shouted was recall at rank, and nothing in the stack was measuring it.

The four-step recall rig

After we stabilised the incident, rolled back to the previous index from a logical backup, rebuilt with the correct parameters using CREATE INDEX CONCURRENTLY, and told the affected tenants in writing what they had seen and why, we wrote down a recall rig. We now run it before a retrieval layer goes live for any clinical tenant, and on every change to an index, embedding model, or chunking strategy after that.

1. A frozen golden set, per tenant, written by a human

Two hundred real questions per tenant, each with the ground-truth document ID and the ground-truth chunk ID attached. We do not auto-generate these. A domain expert from the customer sits with one of our engineers for two afternoons and writes them. The set goes into the tenant's config, gets checksummed, and only the domain owner can change it. Auto-generated golden sets paper over exactly the kind of long-tail blind spot that broke this customer.

2. Recall@k and MRR, measured before and after every change

The numbers are blunt and we like that. The rig runs against the same golden set every time, and the diff is what matters.

def measure_recall(golden_set, retriever, k=5):
    hits = 0
    mrr_sum = 0.0
    for q in golden_set:
        ranked = retriever.search(q.text, k=k)
        chunk_ids = [r.chunk_id for r in ranked]
        if q.gold_chunk_id in chunk_ids:
            hits += 1
            rank = chunk_ids.index(q.gold_chunk_id) + 1
            mrr_sum += 1.0 / rank
    return {
        "recall_at_k": hits / len(golden_set),
        "mrr": mrr_sum / len(golden_set),
    }

We fail the deploy if recall@5 drops by more than one percentage point, or if MRR drops by more than 0.03, against the last green run. The build that triggered the incident would have failed this gate easily: on the clinical golden set, recall@5 dropped roughly fourteen points and MRR collapsed from 0.71 to 0.49.

3. Synthetic distribution coverage

The golden set is the floor, not the ceiling. On top of it we generate a synthetic query set from the source documents themselves: questions phrased six different ways per chunk, including formal, conversational, abbreviated, in Dutch where the corpus has Dutch source material, with a deliberate typo, and as a clinician would actually phrase it on a Monday morning. The synthetic set catches regressions on phrasings the human author did not think of. It does not replace the golden set. It widens the net so a change that improves the average without improving the worst case is still visible.

4. Production drift alarm

A nightly job samples fifty real production queries (PII-scrubbed), replays them against the current index and against a frozen snapshot of last week's index, and computes the overlap of the top-three chunk IDs per query. If the average overlap drops below 70 percent without a corresponding intentional change in the deploy log, we get paged. This is the step that would have caught the incident inside the same hour if the rig had been in place. The drift alarm is what closes the gap between "we test before deploy" and "someone could change the index out of band."

We keep last week's index in a separate Postgres schema, refreshed every Sunday during the maintenance window. The storage cost is one extra index plus its WAL footprint, which for this customer runs about 1.4 GB. Cheap insurance against the next out-of-band change, and a useful artefact to diff against when a tenant says "the answers feel different this week."

What we changed in their process, not just their code

The technical fixes were boring. Rename the Makefile target so it tells the truth. Gate it behind a confirmation prompt. Move the seed file's loose parameters into an environment-specific variable so production and local cannot share them by accident. Add the recall rig to CI. None of that is a story.

The process change is the one we keep coming back to with other customers. This team had a release checklist for the application, and a separate release checklist for the database. Neither of them mentioned the retrieval layer. The vector index sat in a dead zone between "infra" and "model," and nobody owned its quality bar. The on-call engineer for the database was not the on-call engineer for the agent.

We added a single owner, one dashboard with recall@5 and MRR per tenant, and a rule we have since copied into every other RAG engagement we run: any change that touches the retrieval layer ships behind the rig, or it does not ship. Embedding model upgrades. Chunk size changes. Index rebuilds. All of it. The rig is what catches the kinds of silent failure that 200-response health checks cannot.

The single owner is not a job title. It is a name in a wiki next to the dashboard, with the same person on-call for retrieval drift as for the agent that consumes it. That alone closed the gap we found ourselves in at 15:07 on a Friday: the database engineer who pushed the commit and the agent engineer who got the customer call were finally pointing at the same dashboard.

When we built the RAG agents for this customer's clinical product line, the thing we kept running into was exactly this silent-degradation pattern, dressed up as a routine database operation. We ended up solving it by putting recall measurement on the same release rail as code coverage: not optional, not advisory, gated.

The smallest thing you can do today, before you close the tab: open whatever script your team calls "reindex" and read the SQL it actually runs. If it does anything other than REINDEX, write the parameters down and check them against the index definition in production. Five minutes, and you find out whether you have the same hidden migration in your repo.

Key takeaway

Vector index health is a release-gate concern. If a routine REINDEX can silently cut your recall, your retrieval layer needs the same test discipline as your application code.

FAQ

Does a normal pgvector REINDEX change my HNSW parameters?

No. A real REINDEX preserves the index definition. If your parameters changed after a 'reindex,' the script you ran was almost certainly a DROP INDEX followed by a CREATE INDEX with different WITH clauses.

What is a reasonable ef_construction for production HNSW in pgvector?

The pgvector default is 64. Teams with dense, technical corpora and precise queries often raise it to 200 or higher, trading build time for recall. Always measure on your own golden set before locking it in.

How do we measure RAG recall without a domain expert to label data?

You can start with synthetic golden sets generated from your own chunks, but they understate long-tail failure. For anything regulated or safety-relevant, pay a domain expert for two afternoons of question authoring per tenant.

Why does a frontier model still hallucinate when retrieval misses?

Given semi-relevant context, the model completes the pattern of what surrounding chunks look like. If the chunks are sub-clauses of a standard, the hallucination is also a plausible sub-clause number. Better retrieval is the only fix.

How often should we re-run the recall rig in production?

Always before any retrieval-layer deploy. Nightly drift checks on a sampled production query set. Full golden-set replays at least weekly, and immediately after any database operation that touches a vector index.

ragai agentsknowledge basecase studyarchitectureoperations

Building something?

Start a project