RAG

RAG stacks: hosted, Qdrant, or pgvector for Dutch B2B SaaS

It is 17:04 on a Thursday. Sales just told you the demo agent quoted the wrong SLA. Who opens the chunker, and which stack did you bet the answer on?

Jacob Molkenboer· Founder · A Brand New Company· 11 Apr 2025· 8 min

Half-open oak card-catalogue drawer with buff index cards, brass divider, green tab, red wax seal on ivory paper.

It is 17:04 on a Thursday in a Utrecht office. The product lead at a €9M ARR Dutch B2B SaaS has just been pinged by sales: the customer-facing RAG agent quoted the wrong SLA window to a prospect on a live call. The CTO opens Slack, then realises she does not know which file in which repository owns the chunker. The agent has 312 PDFs, 1,400 Confluence pages, and a Notion export behind it. The question on the table is not which vector database is fastest. The question is who fixes this in the next twenty minutes, and what the bill looks like at the end of the month.

This is the decision we walk every sub-€18M Dutch B2B SaaS through when they ask us to ship a customer-facing knowledge agent. There are three serious options in 2026 for a team this size. Hosted file-search at the model vendor. A self-hosted Qdrant stack with a reranker. A Postgres pgvector setup with BM25 hybrid search inside the database you already run. None of them is wrong. All of them are wrong for someone.

The two axes that actually decide it

Most architecture posts score RAG stacks on recall@10, latency, and "developer experience". Those matter, but they are not what kills a project in year two. The two axes we score on are blunt.

First: per-tenant cost at 50,000 pages. Not per-request cost. Not per-token cost in isolation. The all-in monthly cost to keep one customer's corpus indexed, embedded, queried at their actual call volume, and re-embedded when you change models. Fifty thousand pages is the size where every option's pricing curve stops being linear. Below that, anything works. Above that, you discover which contract you signed.

Second: who fixes the chunker at 17:00 on a Thursday. When the agent gives a bad answer, somebody has to open the indexing pipeline, find the document, look at how it was split, adjust the split, re-embed the affected pages, and verify. That person needs to exist, needs to be reachable, and needs read access to the code that does the splitting. If the splitting happens behind a vendor API you cannot see, the answer to "why did it quote the wrong SLA" is "we filed a ticket".

Takeaway

RAG architecture is a staffing decision in disguise. Pick the stack whose failure mode matches the person you can call.

Option A: hosted file-search

Anthropic, OpenAI, and a handful of others now offer file-search as a managed feature of their assistant or agent APIs. You upload documents, the vendor handles chunking, embedding, retrieval, and reranking, and you get citations back in the response. For a four-person team without a dedicated infra hire, this is a real answer.

The math at 50,000 pages is roughly this. Storage and indexing fees per-tenant are modest, usually a single-digit-euro item. The cost stack lives in inference: every query reads context, and that context is billed at input-token rates. At a customer with 80 agent conversations a day and an average of 4,000 tokens of retrieved context per turn, you are looking at a few hundred euros per tenant per month on a mid-tier model. Not catastrophic. Predictable.

The pain is not the bill. The pain is the chunker. You cannot see it. When sales reports the bad SLA answer, your CTO opens the document, sees the SLA table is rendered as a two-column PDF with a footnote, and realises the vendor's chunker probably split the footnote away from the row. Her options are: rewrite the source PDF, or change the prompt to compensate. She cannot change the chunking strategy. The fix lands in the source document, not the code.

There is also a wrinkle around data posture. Hosted file-search means your indexed content lives on vendor infrastructure, frequently outside the EU. The EDPB's post-Schrems II recommendations still drive Dutch procurement reviews for public-sector, healthcare, and finance buyers. If those are your customers, hosted file-search becomes a procurement conversation, not a technical one. We have had two deals stall in 2026 on exactly this point.

When we pick it

Single-product company, fewer than ten paying tenants, no infra engineer, English-only or Dutch-only corpus under 20,000 pages per tenant, and a buyer base that does not flinch at US data residency. Ship it Monday. Move on.

Option B: Qdrant plus a reranker

Self-hosted Qdrant on a small VM, an open embedding model behind a tiny FastAPI service, and a reranker like bge-reranker-v2-m3 in front of the final top-k. This is the stack that wins on quality for messy, multilingual corpora. It is also the stack that asks the most of the team.

Per-tenant cost at 50,000 pages drops sharply. A single Qdrant node on a €60-a-month box handles a handful of tenants comfortably if you partition by collection. Embedding is a one-time cost per document version. Query-time cost is almost entirely the LLM call itself, because retrieval is local. We have customers running this stack at under €40 per tenant per month all-in, reranker included, on production traffic.

The 17:00 question has a real answer here. Your engineer opens indexer/chunk.py, sees the function that splits the SLA PDF, notices the table-aware branch was never wired in, fixes it, re-embeds the affected document, and the agent gives the right answer by 17:18. That is the dream. The cost of the dream is that someone has to own that file. If your team is three people and none of them want to be on call for Qdrant upgrades, you are buying yourself a second product to maintain.

from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(url="http://qdrant:6333")

client.create_collection(
    collection_name=f"tenant_{tenant_id}",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

# Hybrid: dense vectors in Qdrant, sparse BM25 in a sidecar.
# Rerank the union with a cross-encoder before sending top-5 to the LLM.

When we pick it

Multi-tenant SaaS with more than ten paying customers, a corpus that mixes Dutch and English, a buyer base that demands EU data residency, and at least one engineer who will own the indexer. We have shipped this for two clients in the past year and it is our default when the team has the muscle.

Option C: Postgres pgvector with BM25 hybrid

If you already run Postgres, and you already have one engineer who knows it well, pgvector plus the pg_search BM25 extension is the most boring and most underrated option of the three. You add two extensions to the database you already back up, and your retrieval lives next to the rows it talks about.

The cost story is the cleanest of the three. You are paying for storage and compute on a database you would run anyway. At 50,000 pages per tenant, a modest Postgres instance handles a dozen tenants without breaking a sweat. We have a client running this for €18 per tenant per month, fully loaded, on Hetzner.

The 17:00 question is the easiest of all to answer, because the chunker lives next to the application code and the data lives in the same database as the tenant row. Your engineer writes a SQL query in the staging console, sees the wrong chunk, fixes the splitter, runs a backfill, and goes home. The downside is recall. Pure pgvector loses to a tuned Qdrant + reranker stack on hard queries, and the gap widens on multilingual corpora. The BM25 hybrid closes most of it, but not all.

Warning

Do not pick pgvector because it is cheap if your team has never written a database migration under pressure. The savings disappear the first time someone forgets to add an index and search latency goes to four seconds.

When we pick it

You already run Postgres in production. You have one engineer who is comfortable in it. Your corpus per tenant is under 100,000 pages. You want one fewer system to back up, monitor, and explain to the next hire.

The scoring sheet we actually use

We score every prospective RAG stack on six lines before we commit. Per-tenant cost at projected scale. Who owns the chunker. Data residency posture. Latency budget for the agent's first token. Cost to swap embedding models in eighteen months. Cost to add the eleventh tenant.

The last one is the trap. Hosted file-search is cheapest at tenant one and most expensive at tenant fifty. Qdrant is the inverse. pgvector is flat across the curve until you hit the recall ceiling. If you cannot say with a straight face how many tenants you will have in eighteen months, score the middle column and bias toward the option that lets you change your mind without a migration.

What we did for one Dutch client

When we built the customer-facing knowledge AI agent for a Dutch logistics SaaS this spring, the thing we ran into was that their corpus was 60% Dutch shipping regulation PDFs with tables, 40% English API docs, and the buyer was a public-sector procurement officer who would not sign on US data residency. We ended up shipping Qdrant plus a multilingual reranker on a Hetzner box in Falkenstein, and put the chunker in the same repository as the application so the same engineer who owns the agent owns the splits.

Before you pick a stack, do this five-minute audit. Open your largest source document. Read the worst-formatted page. Ask yourself who, by name, will fix the chunker when that page gets quoted wrong. If you cannot name the person, you are not choosing an architecture. You are choosing who to blame.

Key takeaway

Pick the RAG stack whose failure mode matches the person on your team who will fix it at 17:00 on a Thursday.

FAQ

Is hosted file-search ever the right call for a serious B2B SaaS?

Yes, when you have fewer than ten tenants, no infra engineer, and a buyer base that does not require EU data residency. It is the right answer for the first eighteen months, then you revisit.

Why include BM25 alongside pgvector instead of just vectors?

Vectors miss exact-string and acronym matches that customers actually type. BM25 catches them. The hybrid recovers most of the recall gap to a dedicated vector database.

How much does the reranker actually add?

On messy multilingual corpora, a cross-encoder reranker over the top-50 typically lifts answer accuracy meaningfully. On clean English-only docs, the gain is smaller and may not justify the latency.

What is the real per-tenant cost difference at 50,000 pages?

In our deployments, roughly €18 for pgvector, €40 for self-hosted Qdrant with a reranker, and a few hundred euros for hosted file-search at production query volume. Your traffic shape will move these.

Can we start on hosted file-search and migrate later?

Yes, but plan for it. Keep your source documents in your own storage, version them, and never let the vendor's chunked representation become your source of truth. Migration is then a re-index, not a rescue.

ragai agentsarchitectureknowledge basesaasstrategy

Building something?

Start a project