← Blog

RAG

RAG retrieval on Dutch huurcontracten: three stacks tested

A 21-person housing corporation in Utrecht runs 3,400 tenant searches a week against 280,000 scanned rental contracts. We tested three retrieval stacks on the real workload.

Jacob Molkenboer· Founder · A Brand New Company· 1 Sept 2025· 7 min
Half-open oak index-card drawer with cream cards, green tab, manila folder, red wax dot on ivory linen.

It is a Tuesday morning in a Utrecht housing corporation office. A complaints officer named Eva opens her laptop. A tenant is disputing a mutatie-kosten line on his move-out invoice. The clause he is fighting was signed in 2014. The contract is a scanned PDF that some intern fed through a Konica Minolta a decade ago. The OCR thinks huurder is hourder in three places.

Eva has 3,400 of these searches a week to handle across a team of seven. The corpus is 280,000 huurcontracten, addenda, and tenant correspondence stretching back to 2002. We were asked to build the RAG retrieval layer behind the agent that drafts her first answer.

This post is the comparison sheet we wrote for that project: LlamaIndex, Haystack, and a hand-rolled pgvector plus BM25 hybrid, scored against the actual workload.

The workload, not a benchmark

The temptation in a RAG comparison is to download MS MARCO and call it a day. We did not. The bar was specific to the corporation:

  • 3,400 Dutch-language queries per week, peak 90 per hour on Monday mornings
  • 280,000 documents, average 4 pages, roughly 11 percent with OCR error rates above 5 percent at the character level
  • Hard requirement: cite the exact paragraph, with page number, for any answer that touches money or termination dates
  • 21 staff, one part-time sysadmin, one Postgres database they already run on Hetzner

That last constraint mattered more than the others. Nobody at the corporation wanted to learn a vector-store DSL on top of the seven systems they already babysit.

Stack one: LlamaIndex on Pinecone

LlamaIndex is the fastest path from zero to a working agent. We had the corpus indexed and a first answer flowing in eleven hours of work. The ingestion pipeline handled the PDFs, ran them through a hosted OCR, chunked, embedded with multilingual-e5-large, and stored vectors in a managed Pinecone index. The ingestion pipeline docs are honest about the trade-offs, which we appreciated.

What we liked: the QueryFusionRetriever gave us a hybrid of dense and sparse out of the box. A Dutch-aware splitter respected paragraph breaks better than naive 512-token windows did, which mattered for the boilerplate clauses that recur across thousands of contracts.

What killed it for production: per-query cost and lock-in. Pinecone at our document count was about 70 euro a month for the index plus another 90 euro per million queries. Add the hosted embedding refresh on document edits and we were looking at roughly 380 euro a month in vendor bills before a single token of generation. The corporation runs lean. That is more than they pay for their accounting software.

There was also the OCR problem. The default settings shipped chunks straight to the embedder. Misread tokens like hourder are out-of-vocabulary for any embedder trained on clean Dutch. Recall on the bad-OCR slice was 0.61 at k=20. Not unusable, but the misses concentrated in the exact clauses that show up in disputes.

Stack two: Haystack on Elasticsearch

Haystack felt like it was written by people who had actually shipped a search system. The component graph in version 2 is clean. We could wire a hybrid retrieval pipeline (BM25 from Elasticsearch, dense from a self-hosted embedder, fused with reciprocal rank fusion) in a single YAML, and the DocumentStore abstraction let us swap backends without rewriting the pipeline.

The reranker story is also better. Haystack drops a cross-encoder over the top-k from retrieval. On the dispute test set (we built one: 84 real complaints from the last 18 months, with the ground-truth clauses tagged by the corporation's lead jurist) the reranker lifted mean reciprocal rank from 0.58 to 0.74.

What pushed us off Haystack in the end was the operational tax. Elasticsearch wants its own JVM, its own backup story, its own version-upgrade calendar. The corporation's sysadmin is one person, three days a week. Asking him to add Elasticsearch to a stack that already runs Postgres, Redis, an MTA, and an ancient SharePoint felt unkind.

We almost shipped it anyway. Retrieval quality was the best of the three. But every conversation about uptime kept ending with "and who restarts it on Sunday morning."

Stack three: pgvector plus BM25, hand-rolled

The corporation already had Postgres. They run it well. So we put the corpus in the database they already trust. The BM25 side rides on ParadeDB's pg_search; the vector side rides on pgvector with an HNSW index.

The schema is unremarkable:

create extension vector;
create extension pg_search;  -- BM25 inside Postgres

create table chunks (
  id          bigserial primary key,
  contract_id text not null,
  page        int not null,
  paragraph   int not null,
  body        text not null,
  body_clean  text not null,   -- OCR-corrected variant
  embedding   vector(1024),    -- multilingual-e5-large
  tsv         tsvector generated always as (
                to_tsvector('dutch', body_clean)
              ) stored
) partition by range (contract_year);

create index on chunks using hnsw (embedding vector_cosine_ops);
create index on chunks using bm25 (id, body_clean)
  with (key_field='id');

Hybrid retrieval is a CTE that runs the two queries, normalises the scores, and fuses with reciprocal rank fusion. The whole thing lives in one stored function the agent calls.

The body_clean column was the trick that broke the recall problem. We ran a one-time pass over the 31,000 chunks that had bad OCR, asking a cheap local model to fix obvious typos against a Dutch dictionary while preserving anything that looked like a name, address, or number. The clean column went to BM25 and to the embedder. The raw column stayed available for display, so tenants and jurists still saw the original wording.

Recall on the dispute set: 0.79 at k=20 for hybrid, 0.86 after a cross-encoder rerank. That beat Haystack's pre-rerank number and matched its post-rerank number. The reranker is bge-reranker-v2-m3, running on a single old GPU the corporation had sitting in a drawer.

Takeaway

Recall problems on OCR-heavy corpora are usually upstream of the retriever. Clean the text once, store both versions, and most of the embedded-model debate goes away.

Per-query cost at 3,400 weekly searches

Costs after three months in production, all-in:

  • LlamaIndex on Pinecone plus hosted embeddings: 0.029 euro per query
  • Haystack on self-hosted Elasticsearch plus self-hosted embedder: 0.011 euro per query, plus roughly 4 hours a month of sysadmin time
  • pgvector plus BM25 on the existing Postgres: 0.003 euro per query, zero new infrastructure

The pgvector number is electricity and the marginal CPU cost of queries on a database that was already running. It is not free. It is the closest thing to free that you can ship.

Who reranks when a tenant disputes a 2014 clause

This was the most interesting question we asked, and the one most comparison posts skip.

When Eva's agent retrieves the wrong clause for a mutatie-kosten dispute, somebody has to notice. In the LlamaIndex setup the reranker was a black box: a hosted cross-encoder behind an API. We could log scores but not inspect why a paragraph ranked. When the jurist asked "why did it surface paragraph 7 over paragraph 12," we shrugged.

Haystack and pgvector both let us run a local reranker we could read. We dumped the top-50 with their dense scores, BM25 scores, fused scores, and rerank scores into a small admin page. The jurist used it once a week to spot-check the disputes that escalated. After six weeks she found a pattern: contracts from a specific 2013 to 2015 window had a templating quirk that pushed the mutatie clause to page 4 instead of page 2. We adjusted the chunker to respect that header. Recall on that slice went from 0.71 to 0.93.

You only get that feedback loop if the humans can see the scores. That argued, in retrospect, for the boring stack.

The Postgres delete that almost bit us

Mid-project, we made a partitioning decision we had been postponing. Housing corporations purge contracts seven years past tenancy end under retention policy. Running a DELETE across an HNSW index of 280,000 chunks would have meant a long, blocking, painful operation every quarter. PostgreSQL's partitioning docs spell out the alternative: DROP PARTITION is metadata-only, while DELETE has to walk every index on the table.

Warning

If your RAG corpus has a retention policy, partition on the retention key from day one. Dropping a partition is instant. Deleting from an indexed table is not.

So chunks are partitioned by contract year. When a contract is destroyed we drop its partition. Index rebuild on the rest of the table is zero. Boring infrastructure choices compound.

What we shipped

The corporation went live in March on the pgvector plus BM25 stack. Eva's team handles dispute correspondence in roughly 40 percent of the time it used to take. The agent drafts; she edits. Citations are exact, with page and paragraph numbers. The sysadmin did not have to learn a new database.

When we built the RAG agents for this corporation, the thing we kept running into was that the retrieval-quality debate was always upstream of the OCR. We solved it by storing a cleaned text column next to the original, embedding the cleaned one, and showing the original to the tenant.

If you are scoping a RAG project this quarter, the five-minute audit is this: open the worst-OCRed document in your corpus, paste a paragraph into your retriever, and search for a sentence from it. If you get nothing back, no vector database is going to fix that for you.

Key takeaway

On OCR-heavy corpora, cleaning the text upstream and storing it next to the original beats every retriever choice. Then pick the stack your sysadmin can already run.

FAQ

Should I start with LlamaIndex or pgvector for a new RAG project?

Start with LlamaIndex if you need a working demo this week. Move to pgvector once you understand your real query volume and your team's operational capacity.

How much does OCR quality matter for retrieval recall?

A lot. On our test set, cleaning OCR errors upstream of the embedder lifted recall by roughly 18 points on the bad-OCR slice. That is more than any retriever choice gave us.

Why not use a managed vector database?

At 3,400 queries a week the vendor bill was about 380 euro a month before generation. The existing Postgres handled the same workload for the electricity cost of a few HNSW queries.

Does Haystack still make sense in 2026?

Yes, if you already run Elasticsearch comfortably. The retrieval quality and rerank story are excellent. The operational tax was the dealbreaker for this client.

What reranker did you ship to production?

bge-reranker-v2-m3, running on an old GPU the corporation already owned. It lifted recall@20 from 0.79 to 0.86 on the dispute test set.

ragai agentsarchitectureknowledge basecase studyoperations

Building something?

Start a project