RAG

RAG drift: a Pinecone rebuild ate 38 indemnity clauses

On day nine, a paralegal flagged a missing indemnity clause. By that afternoon we had counted 38. The contract-review agent had been silently dropping work since the index rebuild, and nobody noticed.

Jacob Molkenboer· Founder · A Brand New Company· 16 May 2025· 9 min

Open oak index-card drawer with cream cards, one bent, chartreuse tag on linen thread, brass divider, leather blotter.

Tuesday, 09:14. A senior paralegal at a 26-person legaltech in Utrecht is reading the agent's review of a vendor MSA. She gets to section 11 and stops. The indemnity clause is not flagged. She knows it should be: she co-wrote the review template against this exact pattern back in February. She pulls yesterday's batch. Same gap on three other contracts. She opens a Slack thread with the engineering lead and writes one sentence: are we sure the agent is reading the whole document?

By 16:00 that day they had counted 38 missed indemnity clauses across nine days of production traffic. The agent had not crashed. It had not throttled. Nothing in their dashboards looked wrong. It had been confidently returning reviews that were missing the single most expensive paragraph in a commercial contract.

This is the post-mortem. We were brought in on day ten to find the cause and rebuild the retrieval layer. The bug was not in the model, not in the prompt, and not in the embeddings. It was in a single integer column that nobody had thought about for six months.

How the contract-review agent was wired

The product was a niche Dutch legaltech SaaS that does first-pass review on inbound contracts for mid-market companies. A user uploads a PDF, the system extracts text, splits it into clauses, and an agent walks a review checklist: indemnity caps, governing law, auto-renewal, data-processing addendum, the usual list. For each item on the checklist, the agent retrieves the most relevant chunks from the contract and asks the model to assess them against a playbook of acceptable language.

The retrieval stack was textbook: chunk the contract into roughly 800-token windows with a 120-token overlap, embed each chunk, upsert to Pinecone, query at review time with the checklist question as the query vector. They were running it in a single index with one namespace per tenant, which is fine. The chunk metadata stored the contract ID, the section heading, and the raw text.

The piece that broke them was how they linked a retrieved vector back to a row in their Postgres. Instead of using the vector's ID as the join key, they had used the chunk's position in the document: the integer sequence number assigned during ingestion. That sequence was generated fresh on every embedding run. Pinecone got the position as the vector ID. Postgres got the position as chunk_seq. They joined on it.

For months, this worked.

Why an index rebuild silently breaks positional IDs

On a Friday afternoon nine days before the paralegal noticed, the engineering lead had run a clean rebuild of the Pinecone index. There was a good reason for it: they were upgrading the embedding model from text-embedding-3-small to a larger model with better Dutch-language performance. The rebuild script did exactly what you would expect. It walked every contract in Postgres, re-chunked it, re-embedded each chunk, and upserted to a fresh Pinecone index. Then it swapped the index alias.

The chunker, however, was non-deterministic in a way nobody had documented. It used a sentence splitter that, given the same input, sometimes produced 47 chunks and sometimes 48, depending on whether a particular abbreviation pattern matched. The chunk content was almost identical between runs. The chunk count, and therefore the positional sequence, drifted by one or two for about a quarter of the corpus.

Postgres still held the old chunk_seq values. Pinecone now held new ones. Every join was off by some small, unpredictable integer.

The retriever pulled the top twelve chunks from Pinecone, mapped their IDs back to Postgres rows, and fed the model whatever rows came back. On most contracts, twelve mostly-right chunks were good enough to find the indemnity clause. On the contracts where the chunk for the indemnity paragraph happened to land one position off the boundary, the model got a chunk from two paragraphs upstream, decided indemnity was not present, and moved on. No error. No retry. No log line that looked unusual.

Warning

If your retriever joins on anything that can change when you rebuild the index (positional offsets, autoincrement IDs, file-order line numbers), you have a silent-failure landmine. The first symptom will not be an exception. It will be a customer noticing missing work.

The nine-day silent drift

What made this nasty is that the failure rate was not constant. Roughly 18% of contracts processed during the nine-day window had at least one wrong chunk returned for the indemnity question. The rest were fine. Their monitoring tracked latency, token usage, error rate, and average confidence score from the model. None of those moved. The model was still confidently writing summaries. They were just summaries of the wrong paragraph.

This is the failure mode that scares us most when we audit a client's RAG stack: the agent is wrong, the agent is sure, and the surface metrics look healthy. It is exactly the shape of the OWASP LLM06 risk around sensitive information disclosure in reverse: instead of leaking what it should not say, it omits what it must.

The pattern shows up wherever autonomous agents reach into production: a confident system doing work nobody is double-checking, until the bill arrives in some form. For our Utrecht client the bill was a quiet email from a customer whose contract had not been reviewed properly. That email was worth more than nine months of Pinecone hosting.

The fix: content-addressed chunk IDs and metadata-only retrieval

The repair has two parts. First, stop using positional IDs anywhere. Second, stop joining back to Postgres at all on the read path.

The broken pattern, simplified:

def retrieve(query_text: str, k: int = 12) -> list[Chunk]:
    q = embed(query_text)
    res = index.query(vector=q, top_k=k, include_values=False)
    # join back to Postgres on the position-derived sequence
    rows = db.execute(
        "SELECT * FROM contract_chunks WHERE chunk_seq = ANY(:ids)",
        ids=[int(m.id) for m in res.matches],
    )
    return [Chunk.from_row(r) for r in rows]

The fix:

import hashlib

def chunk_id(contract_id: str, text: str) -> str:
    # content-addressed: rebuilds are idempotent
    h = hashlib.sha256(text.encode("utf-8")).hexdigest()[:16]
    return f"{contract_id}:{h}"

def upsert(contract_id: str, chunks: list[str]) -> None:
    vectors = [
        {
            "id": chunk_id(contract_id, c),
            "values": embed(c),
            "metadata": {
                "contract_id": contract_id,
                "text": c,
                "section": section_of(c),
            },
        }
        for c in chunks
    ]
    index.upsert(vectors=vectors, namespace=tenant_namespace())

def retrieve(query_text: str, k: int = 12) -> list[Chunk]:
    q = embed(query_text)
    res = index.query(
        vector=q, top_k=k,
        include_metadata=True,
        namespace=tenant_namespace(),
    )
    return [
        Chunk(
            id=m.id,
            contract_id=m.metadata["contract_id"],
            text=m.metadata["text"],
            section=m.metadata["section"],
            score=m.score,
        )
        for m in res.matches
    ]

Two things change. The vector ID is now derived from the chunk's content (a truncated SHA-256 prefixed with the contract ID), so a rebuild produces the same ID for the same paragraph. And the read path no longer touches Postgres. The text the model sees comes from the vector's metadata, fetched in the same call. This is what Pinecone's own upsert docs have quietly recommended for years: treat metadata as the source of truth for what the LLM reads.

Takeaway

The vector ID is a join key that travels with the embedding. Make it deterministic from content, not from order. If you cannot regenerate the same ID from the same paragraph, your index is not actually idempotent.

Reconciling the missed work

Fixing the live system was step one. Telling the client's customers what we had missed was step two, and it was the part the engineering lead lost sleep over. We wrote a reconciliation job: for every contract reviewed in the nine-day window, re-run the indemnity-clause check against the new retrieval pipeline, diff the outputs, and produce a list of contracts where a clause that should have been flagged was not. That gave us the 38.

The legaltech's CEO sent a personal email to each affected customer the next morning. Two churned. The rest stayed, and one told her that the candor was the reason. That is not a metric you can build a dashboard around, but it is the one that mattered.

What to audit on your own RAG today

If you run a retrieval-augmented agent in production, three checks take under an hour and would have caught this:

Open your upsert code and find what you pass as the vector ID. If it is an autoincrement, a row position, a file-order index, or anything derived from "the order chunks came out of the chunker", you have the same landmine. Replace it with a content-addressed hash.
Open your read path and ask whether you join back to a primary database on something other than the vector ID. If yes, why? The metadata field exists for this exact reason.
Write a five-minute reproducibility test: ingest the same document twice, dump the IDs of both runs, diff them. If the diff is non-empty, your index is not idempotent and your next rebuild will quietly corrupt the join.

When we rebuilt the retrieval layer for the Utrecht legaltech, the thing we kept coming back to was that the agent had no way to tell us it was wrong. It had no concept of "I should have found an indemnity clause and I did not". We added a second pass that explicitly asserts presence-or-absence for each checklist item against a separate retrieval scoped to a section-keyword filter, and logs a warning when the first and second pass disagree. That is the kind of work we do on the AI agents we ship: not the happy path, the silent-failure path.

Five-minute audit you can run after closing this tab: grep your retrieval code for the literal string chunk_seq, position, or idx being passed as a vector ID. If you find one, that is your Friday.

Key takeaway

If your retriever joins on positional chunk IDs, an index rebuild will silently corrupt every query. Use content-addressed IDs and read text from metadata.

FAQ

Why didn't standard monitoring catch the dropped clauses?

Latency, error rate, and token usage all looked normal. The model returned confident output every time. The only signal was a domain expert noticing a clause was missing from the review, which took nine days.

Is this a Pinecone-specific bug?

No. The same failure mode happens on any vector store if you use positional or autoincrement IDs and rebuild the index. Weaviate, Qdrant, pgvector, Chroma: all affected if you key your retriever on something that can shift.

How do you make chunk IDs deterministic?

Hash the chunk's text content (SHA-256 is fine) and prefix it with the parent document ID. Re-ingesting the same paragraph then produces the same vector ID, so a rebuild is idempotent and your downstream joins stay valid.

Should the chunk text live in Pinecone metadata or in Postgres?

Both is fine, but the read path should pull text from vector metadata. Treat Postgres as the system of record for governance and audit, and the vector store's metadata as the source of truth for what the LLM actually sees.

How do you detect this kind of drift before customers do?

Add a presence-or-absence assertion for each high-stakes checklist item, run it against a second retrieval with a different filter, and alert when the two disagree. Disagreement is your early-warning signal, not error rate.

ragai agentscase studyarchitectureknowledge basetooling

Building something?

Start a project