RAG

Vector store audit: what we check before a RAG retrofit

A founder forwards last month's vector database invoice and asks for a RAG retrofit quote. We do not send the quote that week. We run the audit first.

Jacob Molkenboer· Founder · A Brand New Company· 10 Nov 2025· 8 min

Open oak index-card drawer with cream cards, one green flag, brass divider, red wax seal on folded paper, ivory desktop.

A Utrecht-based SaaS founder forwarded us their last invoice from a managed vector database vendor on a Tuesday afternoon. Just over €4,200 a month, for a knowledge base their support agent queries about forty times a day. They wanted a quote to retrofit RAG properly: better retrieval, less hallucination, optionally a migration to something they could host themselves. We did not send the quote that week. We ran the audit first.

We do this for every sub-€30M Dutch SaaS that asks us to touch their vector store. The audit takes between two and four working days, depending on how much of the ingestion pipeline lives in cron scripts versus in the application code. It produces six numbers and one go/no-go recommendation. Below is what is on the checklist, why each line is there, and how to run it yourself if you want to skip hiring anyone.

The 90-day ingestion window

We do not trust anything older than ninety days as representative of current behaviour. By that age the codebase has shifted, the ingestion script has been rewritten at least once, somebody has added a new document type and forgotten to update the chunker, and a model version has moved underneath the embeddings. Anything before the ninety-day mark is archaeology. We log it, we do not score it.

The first thing we pull is the count of vectors created per week, grouped by source document type and by ingestion job. If the histogram has a flat patch in the middle of the window, something stopped. If it has a vertical wall on a Tuesday, someone re-ingested the whole corpus without telling anyone. Both are red flags, but they are interesting red flags. We want to know about them before we quote.

Chunk-overlap drift

The first column we score is whether your chunks look like they came from the same chunker. In a healthy index, chunk length distribution is tight: a clear central peak, low variance, predictable overlap. In a drifted index, the histogram is bimodal or worse. That is your hint that someone changed the splitter mid-stream, or that you have two ingestion paths writing into the same collection with different settings.

Here is the SQL we run against a pgvector store. It assumes a document_chunks table with a chunk_text column and a metadata JSONB column that holds the overlap as it was at ingest time. If you do not have that metadata, that is itself a finding: it means you cannot reconstruct what chunker produced each vector.

WITH windows AS (
  SELECT
    date_trunc('week', created_at) AS week,
    COUNT(*) AS chunks,
    AVG(LENGTH(chunk_text))::int AS avg_len,
    STDDEV(LENGTH(chunk_text))::int AS std_len,
    AVG((metadata->>'chunk_overlap')::int)::int AS avg_overlap
  FROM document_chunks
  WHERE created_at > NOW() - INTERVAL '90 days'
  GROUP BY 1
)
SELECT
  week,
  chunks,
  avg_len,
  std_len,
  avg_overlap,
  avg_len - LAG(avg_len) OVER (ORDER BY week) AS week_delta_len
FROM windows
ORDER BY week;

The drift score we report is the maximum week-over-week absolute delta in average chunk length, divided by the global average. Anything above 0.15 (a 15% jump in mean chunk size between two consecutive weeks) gets flagged. We have not yet seen an index above 0.30 that produced good retrieval without intervention.

Embedding-model version pinning

This is the silent killer we find most often. Someone wired the ingestion pipeline against text-embedding-ada-002 back in 2023, then upgraded to text-embedding-3-small in early 2024 for the new ingestion job, then never re-embedded the historical corpus. The result: roughly 70% of the vectors live in one embedding space and 30% live in another. Cosine similarity between them is statistically random. The agent retrieves the wrong chunks and nobody knows why.

The fix is to pin the embedding model and its version in metadata at write time, and to be able to query for the distribution. OpenAI documents explicitly that newer embedding models are not interoperable with older ones; the same rule applies to Cohere, Voyage, and any self-hosted model whose weights you might re-quantise. Pin it. Always.

SELECT
  metadata->>'embedding_model' AS model,
  metadata->>'embedding_model_version' AS version,
  COUNT(*) AS vectors,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct
FROM document_chunks
GROUP BY 1, 2
ORDER BY vectors DESC;

If this query returns more than one row with a non-trivial percentage, the index needs a re-embedding pass before any retrieval improvement work is worth doing. We do not quote optimisation against a mixed-model index. It is the equivalent of tuning a carburettor on an engine that has two different fuels in two of the cylinders.

Warning

If your metadata does not include the embedding model name and version on every vector, treat the whole index as untrusted until you can prove otherwise. There is no honest way to retrieve across two embedding spaces, and no honest way to evaluate retrieval quality without knowing which space each vector lives in.

Per-tenant ACL filters on a multi-org index

Almost every Dutch SaaS we audit runs a multi-org index: one collection holding documents from every customer, with a tenant_id field used to scope queries. This is fine right up until it is not. The audit checks three things, in this order.

First, is tenant_id present on every single vector? We count nulls, then we count distinct tenant_ids against the application's tenant table. If a single vector has a null tenant_id, the system can return it to anyone. That is a breach waiting for its first audit.

Second, does the query plan actually use a filter index? In pgvector you want an HNSW index on the embedding column plus a B-tree on tenant_id, and the plan should apply the tenant filter as a pre-filter, not a post-filter. The execution plan should look like an index scan on tenant_id feeding a vector search, not a vector search feeding a tenant filter.

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content,
       embedding <=> $1::vector AS distance
FROM document_chunks
WHERE tenant_id = $2
ORDER BY distance
LIMIT 10;

Third, what happens when the application code forgets to pass tenant_id? We grep the codebase. If the answer is that the database returns every row across every tenant, we write that finding in red and we do not move on until it is fixed. The correct default is a function or a view that requires the tenant scope, or a row-level security policy that enforces it server-side. Application-layer-only ACLs are one missing parameter away from a data incident.

The pgvector to Qdrant survival test

The question the founder usually asks at the end of the first call: should we move off pgvector to something dedicated? Qdrant, Weaviate, Pinecone, pick the logo. Our answer depends entirely on what the audit shows, and we frame it as a survival test.

We pick three tenants from the customer cohort. The biggest by document count. The most active by query volume over the last thirty days. And one that was onboarded in the last sixty days, because new tenants exercise the current ingestion path most aggressively. We simulate the migration for these three tenants only, against a throwaway Qdrant collection, and we ask: do their ACL filters survive without changes to the data?

In Qdrant, per-tenant ACL becomes a payload filter on every search call. The vector data and the tenant_id move across cleanly if the tenant_id was in metadata at ingest. If it was baked into the collection name (one collection per tenant), the migration is also clean but the cost calculus shifts: Qdrant collections have a non-trivial memory footprint each, and a SaaS with three hundred tenants on a per-tenant collection model will burn more RAM than the pgvector setup it left behind.

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue

client = QdrantClient(url="http://localhost:6333")

hits = client.search(
    collection_name="chunks",
    query_vector=query_vec,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="tenant_id",
                match=MatchValue(value=tenant_id),
            ),
        ],
    ),
    limit=10,
)

If all three test tenants survive the migration with their ACL filters intact and their retrieval quality measurably equal or better, the migration is mechanical engineering work. We quote it by the day. If even one of them fails (mismatched embeddings, missing tenant_id, ACL leak in the payload filter), we tell the founder to fix the ingestion path first and to reconsider the migration in a quarter. Migrating a broken index produces a faster broken index.

Local models and the budget question

Three of the last five audits ended with the same follow-up: can we run the embeddings locally instead of paying a vendor per token? The Hacker News front-page thread this week asking whether anyone has replaced GPT or Claude with a local model for daily coding has been hovering at the top for a reason. The hardware curve and the open-model curve are crossing for a specific kind of workload, and embedding generation is firmly on the side where local is now realistic for a mid-size SaaS.

The audit answers this concretely. We compute the monthly token volume across the corpus and the query stream, multiply by the current vendor's rate, and compare to the amortised cost of a single workstation running a sentence-transformers model on a consumer GPU. For two of the last five audits, local won by a factor of three over a two-year horizon. For the other three, the engineering time to run and monitor the local inference path was the dominant cost and the vendor remained the right call. There is no universal answer. There is the audit.

What the checklist costs you to run

The whole audit is six SQL queries, one Python script, and a half-day spent reading the ingestion code. You can do it yourself this week, if you have the access. The output is six numbers (drift score, model purity, tenant_id coverage, query plan correctness, three-tenant survival result, projected migration cost) and a one-line recommendation. Run it before you sign anything: before you renew the vector database contract, before you commit to a migration, before you let an AI consultancy quote a retrofit.

When we ran this exact audit for a Rotterdam logistics SaaS last month, the headline finding was that 41% of their vectors had been embedded with a model deprecated for nine months, and their per-tenant ACL was a Django view filter with no row-level policy behind it. We re-embedded the historical corpus over a weekend and shipped the row-level security policy on the Monday. The retrofit quote followed the audit, not the other way around: that is roughly how we run every AI agents and RAG engagement.

The smallest thing you can do today: open a SQL client, run the embedding-model distribution query above against your own vector store, and find out how many embedding spaces are mixed inside your index. If the answer is more than one, you have your first finding.

Key takeaway

Run the audit before you quote, migrate, or renew. Six SQL queries and a half-day of code reading beat six months of mystery-retrieval debugging.

FAQ

What is chunk-overlap drift and why does it matter?

It is variation in chunk size and overlap across your ingestion history. It matters because cosine similarity across vectors built with different chunkers is noisier than it looks, which degrades retrieval in ways that are hard to debug.

Should we keep pgvector or migrate to a dedicated vector database?

Depends on the audit. If your ingestion path is healthy and tenant_id metadata is clean, pgvector handles low millions of vectors fine. Migrate when query latency or ACL complexity outpaces what Postgres can do, not before.

How often should we re-run this audit?

Quarterly is the right cadence for an active product, or after any change to the ingestion pipeline. The ninety-day scoring window is designed to line up with that rhythm.

What if our metadata does not include the embedding model name?

That is finding number one. Add the embedding model and version to every new vector at write time, then plan a re-embedding pass for the historical corpus during a low-traffic window. Treat the existing index as untrusted until then.

ragknowledge basearchitecturemigrationoperationssaas

Building something?

Start a project