← Blog

RAG

RAG outage: nine hours of stale Qdrant on a live agent

On a Tuesday morning a customer-service RAG agent for a Dutch insurer started quoting policy clauses that had been retired three weeks earlier. Nine hours later, we knew why.

Jacob Molkenboer· Founder · A Brand New Company· 10 Nov 2024· 9 min
Open oak index-card drawer on linen, one cream card lifted with green ribbon, brass divider, red wax seal on folded note.

On a Tuesday morning in February, a customer-service RAG agent we run for a Dutch insurance client started quoting policy clauses that had been retired three weeks earlier. The agent was confident. The clauses were dead. For the next nine hours, every caller who asked about cancellation windows got an answer that would have cost the insurer a small claims case.

The architecture was the one most teams ship in week three. Postgres as the source of truth for policies, FAQ entries, and decision tables. A self-hosted Qdrant cluster as the vector index over chunks of those documents. An ingestion job that ran every night at 02:00 to re-embed anything that had changed. On most days that nightly gap was invisible. On Tuesday it was not.

The fault we shipped

The trigger event was a routine policy edit. Ops retired three clauses from the cancellation flow on Monday afternoon. They hit save in the CMS, the row landed in Postgres, and the relevant policy went live for human agents instantly. The RAG side did not move.

That alone was a problem we had documented and accepted. A worst-case 24-hour staleness window between a write in Postgres and a corresponding update in Qdrant. Operations had been told. The risk was logged in the runbook.

The actual failure was deeper. When Monday's clauses were deleted, the corresponding Qdrant points stayed in the collection. The nightly job re-embedded everything that was present in Postgres. It never asked which points in Qdrant had no matching Postgres row. There was no tombstone path.

So Tuesday morning, the agent retrieved chunks whose payloads still contained the old policy text. The retriever joined back to Postgres on entity id, found nothing, and the application code treated 'no join' as 'no extra metadata, ship the chunk.' The chunk's payload field went straight into the context window. The model answered from the dead text with full confidence.

It took nine hours, two customer complaints, and a manual audit by a compliance officer to spot the pattern. The fix that morning was crude. We truncated the Qdrant collection and re-ran the full embedding job from Postgres. By 18:00 the agent was clean. By 19:00 we were writing the post-mortem.

The post-mortem had one line that mattered. There was no fence. We had a job, not a guarantee.

Why dual writes are a euphemism for lying

The naive fix you reach for after this incident is dual write. When application code updates Postgres, also write to Qdrant in the same handler.

def update_policy(policy_id, new_body):
    db.execute(
        "UPDATE policies SET body = %s WHERE id = %s",
        (new_body, policy_id),
    )
    chunks = chunk_and_embed(new_body)
    qdrant.upsert(collection_name="policies", points=chunks)

This looks atomic. It is not. The two calls reach two different systems over two different network paths. Either can fail independently. The process can crash between them. There is no transaction that spans both. The probability of inconsistency is not zero, it accumulates with every write.

What we had was worse, because we had a nightly batch instead of a dual write. But dual write would not have saved us. Dual write trades a 24-hour staleness window for a smaller, sneakier inconsistency window. The bug class is identical. Two independent state machines that the application pretends are one.

The pattern that does work is the transactional outbox. It is older than RAG, older than vector stores. Microsoft documents it as the answer to 'reliably publish a message when you commit a database transaction.' Substitute 'index in Qdrant' for 'publish a message' and the shape is the same.

The idea is plain. Every state change writes two rows in the same Postgres transaction. One row is the actual data update. The other is an event in an outbox table. A worker process reads the outbox, projects each event into the downstream system, and marks it processed. If the worker crashes mid-batch, it replays. If the downstream rejects a write, it retries. If the source row gets deleted, the outbox carries a delete event and the worker deletes the matching points. Postgres is the only place where writes are negotiated. Everything else is a projection.

The outbox fence

For the insurer, we landed on one extra table, one version column on every indexed source table, and one worker. The outbox itself is unremarkable.

CREATE TABLE rag_outbox (
  id            BIGSERIAL PRIMARY KEY,
  occurred_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
  processed_at  TIMESTAMPTZ,
  entity_kind   TEXT NOT NULL,
  entity_id     UUID NOT NULL,
  op            TEXT NOT NULL CHECK (op IN ('upsert','delete')),
  source_version BIGINT NOT NULL,
  payload       JSONB
);

CREATE INDEX rag_outbox_unprocessed
  ON rag_outbox (id)
  WHERE processed_at IS NULL;

The partial index keeps the working set small even as the table grows into the millions of rows. We never truncate it. The audit trail is too useful when a customer complaint lands six weeks later asking why the agent said what it said.

The application write becomes one transaction.

def update_policy(policy_id, new_body):
    with db.transaction():
        row = db.fetch_one(
            "UPDATE policies SET body = %s, version = version + 1 "
            "WHERE id = %s RETURNING version",
            (new_body, policy_id),
        )
        db.execute(
            "INSERT INTO rag_outbox "
            "(entity_kind, entity_id, op, source_version, payload) "
            "VALUES (%s, %s, %s, %s, %s)",
            ("policy", policy_id, "upsert", row["version"],
             json.dumps({"body": new_body})),
        )

Now the Postgres update and the outbox event commit together or not at all. They cannot diverge. The version column on the source table increments on every change and gets copied into the event. That version is the fence.

The worker drains the outbox in batches, with the standard FOR UPDATE SKIP LOCKED pattern so multiple workers can run in parallel without stepping on each other.

def drain_outbox(batch_size=100):
    rows = db.fetch_all(
        "SELECT * FROM rag_outbox "
        "WHERE processed_at IS NULL "
        "ORDER BY id LIMIT %s FOR UPDATE SKIP LOCKED",
        (batch_size,),
    )
    for row in rows:
        if row["op"] == "delete":
            qdrant.delete(
                collection_name="policies",
                points_selector=Filter(must=[
                    FieldCondition(
                        key="entity_id",
                        match=MatchValue(value=str(row["entity_id"])),
                    )
                ]),
            )
        else:
            chunks = chunk_and_embed(row["payload"]["body"])
            qdrant.upsert(
                collection_name="policies",
                points=[
                    PointStruct(
                        id=str(uuid4()),
                        vector=ch.vector,
                        payload={
                            "entity_id": str(row["entity_id"]),
                            "source_version": row["source_version"],
                            "text": ch.text,
                        },
                    ) for ch in chunks
                ],
            )
        db.execute(
            "UPDATE rag_outbox SET processed_at = now() WHERE id = %s",
            (row["id"],),
        )

The entity_id filter on the Qdrant delete matters. A single source row can produce many chunks, each its own point. Deleting by source id rather than by point id means you never leak chunks when a row is removed. The Qdrant points API supports filter-based deletes natively, which is the small reason we kept Qdrant over rolling our own.

Latency at the write path went up by a few milliseconds for the extra insert. We never noticed it in production. The agent's read path is where users feel time, and that path was untouched.

Retrieval-time version checks

The outbox is necessary but not sufficient. It guarantees that every Postgres write produces a downstream effect, eventually. It says nothing about the gap between 'eventually' and 'now.' For a customer-facing agent answering policy questions in real time, that gap matters.

So the second fence runs at retrieval. Every point in Qdrant carries the source_version it was indexed from. The retriever fetches its top-k from Qdrant, then joins back to Postgres on entity_id and compares versions. If Postgres has a newer version than the chunk claims, the chunk is stale and gets dropped. If Postgres has no row at all, the chunk is a ghost and gets dropped.

def retrieve(query, k=8):
    hits = qdrant.search(
        collection_name="policies",
        query_vector=embed(query),
        limit=k * 2,
    )
    entity_ids = [h.payload["entity_id"] for h in hits]
    rows = db.fetch_all(
        "SELECT id, version FROM policies WHERE id = ANY(%s)",
        (entity_ids,),
    )
    current = {str(r["id"]): r["version"] for r in rows}
    fresh = [
        h for h in hits
        if current.get(h.payload["entity_id"]) == h.payload["source_version"]
    ]
    return fresh[:k]

On the insurer's hardware this adds one Postgres round-trip per query, about 4 ms over k=16. The agent's median end-to-end latency is dominated by the model call, not the retrieval. We measured before and after. The p50 did not budge.

What changed was the guarantee. Retrieval used to be 'trust the nightly job.' Now it is 'verified at every call.' A chunk in Qdrant cannot lie to the prompt without first lying to Postgres, and Postgres does not lie.

Reconciliation as a third fence

Outbox covers writes. Version check covers reads. The third fence covers the cases neither one anticipates. A buggy chunker that silently drops content. A manual repair that bypasses the application. A Qdrant collection that gets restored from a stale snapshot during a hardware swap.

The reconciler is a 50-line script that runs hourly. It picks 1000 random Postgres rows, fetches the corresponding Qdrant points by entity_id filter, and asserts three things. Every Postgres row has at least one point. Every point has a matching Postgres row. Every point's source_version equals the Postgres version. Any mismatch goes to a Slack channel with the entity id and which assertion failed.

In the first week of running the reconciler, we caught two bugs we had not been looking for. Both were in our chunking step, where a particular markdown table format was producing zero output. Neither would have shown up in tests, because tests assume the input. Reconciliation does not. It looks at what is there and asks whether it should be.

Warning

If your vector store write is not inside the same database transaction as your source-of-truth write, you do not have eventual consistency. You have eventual divergence, and the agent will not know.

What changed in our default template

When we rebuilt the insurer's RAG agent on the new fence, the surprise was how much simpler the prompt got. Once retrieval was trustworthy, we stopped writing defensive language like 'if the policy was recently changed, prefer the most recent version' into the system message. The fence did the work the prompt was pretending to do. The agent's tone improved on a benchmark we run internally, but the real win was operational. Nobody got paged at 02:00 again, and the compliance officer stopped reading transcripts on Sundays.

The smallest thing you can do today: open your RAG repo, find the place where you write to your vector store, and check whether that call sits inside the same database transaction as the source-of-truth write. If it does not, you are one crash away from the same bug we shipped. The fix is two tables and a worker.

Key takeaway

If your vector store write is not in the same transaction as your source-of-truth write, you are one crash away from serving stale answers with confidence.

FAQ

Why not just use pgvector and skip Qdrant entirely?

Single-store setups dodge this class of bug by construction. We kept Qdrant for filter performance on a 40M-point collection, but pgvector is the right default for anything under a few million chunks.

Does the retrieval-time version check work with hybrid search?

Yes. The version comparison runs on the merged result set after BM25 and vector scores are combined. The check is independent of how the candidates were retrieved.

How often should the reconciliation worker run?

Hourly is enough for most agents. If you index high-stakes data like medical or legal content, every 10 minutes with a smaller sample size catches drift faster without hammering either store.

What if the outbox worker falls badly behind?

Alert on outbox lag, not just on errors. We page when unprocessed events older than 60 seconds exceed 500 rows. The version check at retrieval still keeps stale chunks out of the prompt while the worker catches up.

ragai agentsarchitectureoperationscase studyknowledge base

Building something?

Start a project