RAG

Trieve vs Vespa vs Qdrant: picking a service-manual RAG

A 34-person semiconductor-equipment supplier in Eindhoven wanted one honest answer: which retrieval layer survives 140k monthly queries without breaking the ops budget?

Jacob Molkenboer· Founder · A Brand New Company· 14 Jun 2026· 9 min

Open wooden index-card drawer with cream cards, one lifted card with green tab and brass divider, red wax seal on ink blotter.

It is 23:14 on a Tuesday in a clean-room antechamber on the High Tech Campus. A field engineer is standing in a paper suit with a phone in one hand and a torque wrench in the other. He needs the O-ring spec for a vacuum chamber subassembly, and the 1,412-page service manual on his screen has three revisions stacked on top of each other. The wrong number means a six-figure repair the next morning.

That is the situation the retrieval layer behind a service-manual agent has to survive. We spent six weeks comparing three of them for a 34-person semiconductor-equipment supplier whose engineers were running roughly 140,000 searches a month against 92,000 PDF pages. Here is what we found, scored on the three numbers that actually mattered to the customer: cost per 1,000 queries, freshness latency when a manual changes, and who can answer the procurement auditor in February.

The brief

The corpus was about 6,000 documents, mostly assembly notes, calibration procedures and after-action reports written by service engineers in the field. New revisions land roughly twice a week. Old revisions stay around because some installed equipment is fifteen years old and still under contract.

The customer was clear about what they wanted from the retrieval layer. Predictable cost per 1,000 queries. A defensible answer to "how stale can a freshly uploaded revision be before an engineer gets the wrong torque value". And a credible content provenance trail for the annual procurement audit, which reads more like an aircraft maintenance audit than a software review.

We tested three options end to end with a sample of 4,000 real queries from their support inbox.

Trieve as the managed default

Trieve is the YC-backed managed RAG layer. You upload chunks, you call an HTTP endpoint, you get back results with hybrid search and reranking already wired in. Their pipeline is BM25 plus dense embeddings plus an optional cross-encoder rerank.

Time to first useful answer was the fastest of the three. About a day to wire up ingestion, half a day to swap the existing search calls in the agent. The SDK is honest about what it does and the dashboard shows you what came back for a given query.

import requests

resp = requests.post(
    "https://api.trieve.ai/api/chunk",
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "TR-Dataset": DATASET_ID,
    },
    json={
        "chunk_html": chunk_text,
        "tracking_id": f"manual:{doc_id}:p{page}:c{idx}",
        "tag_set": ["assembly-note", f"rev:{revision}"],
        "metadata": {
            "doc_id": doc_id,
            "page": page,
            "revised_at": revised_at,
        },
    },
)
resp.raise_for_status()

What we kept hitting was the ceiling on control. The embedding model is theirs. The reranker is theirs. The version numbers are not in our git repository. When the customer asked "what model produced this score on 4 February", the honest answer was "the model Trieve had deployed that day". That is fine for a lot of products. It was not fine for this one.

Vespa Cloud for engine-room control

Vespa is the search engine Yahoo open-sourced after running it in production for more than a decade. Vespa Cloud is the managed offering. It does dense vectors, sparse vectors, lexical search, ranking expressions and tensor math in one engine, and it lets you write the ranking expression yourself.

We set up a schema with three fields: a BM25-indexed body, a 768-dim dense embedding from a domain-tuned BGE model, and a ColBERT-style late-interaction tensor used only in the second ranking phase. The first iteration cost us four engineering days, mostly because phased ranking has a learning curve and the documentation assumes you already think in cheap-then-expensive cascades. The Vespa phased ranking documentation is the canonical reference and worth reading before you write your first schema.

Vespa Cloud bills by resource (vCPU, memory, disk) rather than per query. At 140k queries a month against 92k pages the resource footprint was small enough that a single production node with a hot standby held it comfortably.

The provenance story here is the strongest of the three. Every field, every ranking weight, every model version lives in a YAML file in our git repository. The deployment is a git push. If procurement asks "show me the search configuration on 4 February", we point at a commit hash and re-run.

Qdrant with a ColBERT-v2 reranker

The third option was the self-hosted stack we run for most clients. Qdrant for dense vectors, a small Tantivy sidecar for BM25, and ColBERT-v2 as the reranker. Qdrant's documentation is clear enough that a back-end engineer can have it indexing in an afternoon. ColBERT-v2 from Stanford is heavier to operate: the index is larger than you expect, GPU choice matters, and rerank latency lives or dies on batch size.

from qdrant_client import QdrantClient
from rerank.colbert import colbert_rerank

qdrant = QdrantClient(host="qdrant.internal", port=6333)

def search(query_text, query_vec, k_dense=200, k_final=10):
    dense = qdrant.search(
        collection_name="manuals",
        query_vector=query_vec,
        limit=k_dense,
        query_filter={"must": [{"key": "is_current", "match": {"value": True}}]},
        with_payload=True,
    )
    candidates = [(h.id, h.payload["text"]) for h in dense]
    return colbert_rerank(query_text, candidates)[:k_final]

We sized it as a Hetzner CCX33 for Qdrant (16 vCPU, 64GB, NVMe) and a separate AX52 with a single NVIDIA L4 for the reranker, both with hot replicas. Hardware bill landed under 300 euros a month.

The hidden cost was us. The chunking pipeline, the ingestion queue, the staleness check, the rerank cascade and the observability all had to be written. Roughly six engineering days for the first version, and about one ops day per month after that.

Cost per thousand queries

At 140,000 queries a month, here is the shape of the cost. Numbers are rounded estimates from our test deployments and the public pricing pages; your invoice will differ.

Trieve, on a mid-tier plan with the rerank toggle on, sat in the order of 4 to 7 euros per 1,000 queries. The per-query number is small; what pushes the total up is the per-chunk storage when you keep two or three revisions of every page.

Vespa Cloud, on the smallest production configuration that held the corpus with a hot standby, came in around 2 to 4 euros per 1,000 queries. Because resource cost dominates over query count at this scale, the price is roughly flat against volume until you have to add nodes.

Self-hosted Qdrant + ColBERT-v2 was the cheapest on hardware, well under 0.50 euro per 1,000 queries. Once you amortise the build cost over twelve months, the all-in number lands at roughly 1.50 to 2 euros per 1,000 queries in year one and clearly below one euro in year two.

The ranking shifts with your time horizon. Ship in three weeks and forget, Trieve wins. Have a back-end team and a five-year horizon, the self-hosted stack is the cheapest by a wide margin.

Freshness when an assembly note changes

A field engineer uploads a revised assembly note at 14:32. When does the agent stop returning the old one?

Trieve: roughly 90 seconds to 4 minutes from upload to first searchable result in our tests, depending on chunk count. Their ingest is asynchronous and the backlog was tight on the days we measured.

Vespa Cloud: under 10 seconds in practice. Vespa indexes in real time and ranks the new document on the next query. This is where its "search engine, not vector database" heritage shows up.

Qdrant: under 5 seconds for the dense vector to be queryable. The ColBERT index merge can lag a few minutes if compaction is left on defaults. We wrote a small "is the newest revision already indexed" probe that the agent calls before answering, because the audit clause says we cannot return a stale revision when a fresh one exists.

Warning

The freshness number that matters is not "how fast the new chunk becomes searchable". It is "how fast the old chunk stops being returned". Most teams measure the first and ship the second as a bug.

Provenance and the February audit

Semiconductor procurement audits want a content provenance trail. For each answer the agent gave on a given day, the auditor wants to see four things: which document revision was returned, which embedding model version produced the vector, which reranker checkpoint scored the candidates, and which ranking parameters were active.

Trieve exposes some of this through their dashboard, but the embedding model and reranker versions belong to the vendor. We could not point at a git commit and say "this was the configuration that day". For a customer whose own customers are TSMC and ASML, that gap was disqualifying.

Vespa Cloud was the easiest to audit. The full ranking specification lives in version control. Embeddings come from a model file shipped with the deployment. Auditor takes the commit hash, you reconstruct the deployment, you re-run the query, the ranking matches.

Qdrant + ColBERT-v2 can match Vespa's provenance story, but you have to build it. You need a stable manifest per query that records the corpus snapshot id, the embedding model hash, the reranker checkpoint hash and the ranking parameters. We ended up writing a small "audit packet" that the search wrapper emits to a logging bucket on every query, indexed by query id and retained for seven years.

What we shipped for the Eindhoven team

We shipped the self-hosted Qdrant + ColBERT-v2 stack with a Vespa-style provenance log around it. The deciding factor was not cost. It was the audit. The customer wanted a configuration they owned end to end, in their own git repository, with the option to swap the embedding model without renegotiating a vendor contract.

If we had been building for a 200-person company that wanted to outsource the whole search problem and never think about it again, we would have shipped Vespa Cloud. If we had been building a six-week prototype to prove the agent was useful before deciding to invest in retrieval, Trieve.

When we built this service-manual agent, the thing we ran into was the audit packet itself, which we solved by writing the manifest at query time rather than at answer time so the trail survives even when the answer-generation step fails. That pattern came out of a few late nights and is now standard in the AI agents we ship.

A five-minute audit you can run today

If you have a RAG system in production right now, pull one query from yesterday's logs. Ask yourself: which document revision, which embedding model version, which reranker checkpoint produced that answer. If you cannot answer all three from a git commit or a logged manifest, you do not yet have a provenance trail. That is the gap to close before the next audit lands on your desk.

Key takeaway

Pick your RAG retrieval layer on audit and freshness, not on per-query price. The cheapest stack is the one whose configuration lives in your git repo.

FAQ

Which retrieval layer is cheapest at 140k queries a month?

On hardware alone, self-hosted Qdrant + ColBERT-v2 was well under 0.50 euro per 1,000 queries. Once engineering time is amortised, it stays the cheapest from year two onward.

How fresh is Vespa Cloud when a new manual revision arrives?

Under 10 seconds from upload to searchable in our tests. Vespa indexes in real time and ranks the new document on the next query without a separate refresh step.

Can Trieve answer a procurement provenance audit?

Partially. Their dashboard exposes chunk-level history, but the embedding and reranker model versions live on Trieve's side and are not pinned to a git commit you own.

Why add ColBERT-v2 on top of Qdrant?

Dense vectors alone miss spec-heavy queries with rare part numbers. ColBERT-v2 rescoring the top 100 to 200 candidates pulls the right chunk to the top without retraining anything.

ragai agentsknowledge basearchitecturetoolingcase study

Building something?

Start a project