RAG

Legal RAG for a notary office: three approaches compared

A clerk took 41 minutes to find a 2003 prenuptial clause. We built three RAG retrievers to answer questions like hers in seconds. The simplest one almost won.

Jacob Molkenboer· Founder · A Brand New Company· 12 Jun 2024· 9 min

Oak index-card drawer open on ivory desk, cream cards with green tab, brass divider, red wax seal, side light.

A senior clerk at a notariskantoor in the Randstad had a question on a Tuesday afternoon. She needed the exact clause language used in the 2003 prenuptial agreement template that the firm had quietly retired in 2008. The archive held roughly 180,000 documents going back to 1971. She had a name, an approximate year, and a memory of one phrase.

She found the document in 41 minutes by hand.

That clerk was the reason we got the call. The office, a 40-person practice in Utrecht, wanted to know whether a RAG system could answer questions like hers in seconds rather than minutes. They had the archive already indexed in a basic search appliance. They wanted something smarter on top.

We built three candidate retrievers against the same eval set. The one we expected to win came second. This is what we learned.

The corpus and the eval set

The archive was a mixed bag. Notarial deeds (akten), executor reports, marital contracts, articles of association, mortgages, and decades of correspondence. Scanned PDFs from the 1970s with OCR errors. Native Word exports from the 2000s. Templated documents where 90% of the text was boilerplate and 10% was the part anyone actually wanted to find.

The clerks gave us 84 real questions they had asked the archive in the previous six months, each one paired with the document it should have returned. Real questions, not synthetic. About a third were exact-phrase lookups ("find the deed that includes the clause about the 1999 garden boundary dispute"). A third were entity questions ("which acts mention the Janssen family between 1985 and 1995"). A third were conceptual ("show me how we used to phrase the executor's duties before 2010").

That eval set is the only reason we made the right decision. Without it we would have shipped what looked clever in a demo.

The BM25 baseline we almost shipped

The first system was the simplest. We chunked every document into 400-token windows with a 50-token overlap, indexed them in Elasticsearch with a Dutch analyzer (Snowball stemmer, custom stoplist that kept legal terms like "ten behoeve van" intact), and threw the question at it.

BM25 is the algorithm Robertson and colleagues described in the 1990s, and forty years on, it still has no business being as good as it is. The Okapi BM25 reference is worth a re-read if you have not seen it in a while. It is one weighted formula. No GPU. No vector database. No embedding model.

On our eval set it scored 71% top-5 recall.

That is not a typo. The simplest possible retriever, with a careful analyzer and good chunking, found the right document in the top five results seven times out of ten. The exact-phrase queries (a third of the set) ran at 96% recall, because BM25 was made for exactly that shape of query. Where it cratered was the conceptual third, where it managed 38%. The entity questions came in around 78%.

We could have shipped this. The clerks would have been happy. We almost did.

Hybrid search and the reranker tax

The hybrid approach was the one we walked in expecting to win. Embed every chunk with a multilingual model, store the vectors in pgvector next to the BM25 index, retrieve from both, fuse the results with reciprocal rank fusion (RRF), then optionally rerank the top 50 with a cross-encoder.

For embeddings we tested three models. A Dutch-specific fine-tune of a sentence transformer, a multilingual large model, and a generic English model as control. The Dutch fine-tune won on conceptual queries by a comfortable margin. The English model performed surprisingly close on entity queries (proper nouns travel across languages), and badly on conceptual ones.

The fusion step was the interesting part. RRF is the boring trick that beats most clever combinations. You take the rank of each document in each list, compute 1 / (60 + rank) for each, sum across lists, and sort. Sixty is the magic number from the original paper. We tried weighted variants. The vanilla version won.

def rrf(rankings, k=60):
    scores = {}
    for ranked_list in rankings:
        for rank, doc_id in enumerate(ranked_list, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

With BM25 plus dense retrieval plus RRF, recall went to 84%. With a cross-encoder reranker on top of the fused list, it went to 91%.

The reranker tax was real. Each query now needed an embedding call, a vector search, a BM25 search, the fusion math, and 50 cross-encoder scores. Cold-path latency went from 80ms to 1.4 seconds. The clerks did not love that. We did not love the operating cost either, because the firm wanted everything self-hosted for confidentiality reasons.

Warning

If your eval set is small (under 200 queries), do not let a 4% recall difference pick your architecture. Bootstrap a 95% confidence interval before celebrating.

Parent-child chunking and the context problem

The third candidate tackled a different problem. Notarial deeds have structure. A typical akte has a preamble, the parties, the considerations, the operative clauses, and the signatures. When a clerk asks a conceptual question, the right answer might live in a specific clause two pages into a 14-page document, but the answer only makes sense when you read it next to the surrounding clauses.

Small chunks retrieve better. Large chunks reason better. Parent-child chunking is the technique that resolves the tension. You index small chunks (say 200 tokens) for retrieval, and store a pointer back to a larger parent chunk (say 1,500 tokens). At query time you retrieve the small chunk and return the parent.

LangChain calls this the parent document retriever, and the documentation is a reasonable starting point if you have not seen the pattern before. We did not use LangChain in production. We wrote the linkage in roughly 80 lines of Python around the same Elasticsearch and pgvector backends.

The implementation that mattered was the parent boundary. Naive parent-child uses fixed-size parent windows. That works for prose. It breaks for legal documents where the meaningful unit is the clause or the article, and where clauses can run from one paragraph to two pages.

We wrote a chunker that walked the document tree (built from the layout hints in the PDF plus regex for clause numbering: "Artikel 1.", "1.1", "Onder a."), and made each parent a complete clause. Small children inside each parent were 200-token windows aligned to sentence boundaries. The retriever returned the parent clause, not the child window.

CLAUSE_RE = re.compile(
    r"^\s*(Artikel\s+\d+|\d+\.\d+|Onder\s+[a-z]\))",
    re.MULTILINE,
)

def split_into_clauses(text):
    boundaries = [m.start() for m in CLAUSE_RE.finditer(text)]
    boundaries.append(len(text))
    return [text[a:b].strip() for a, b in zip(boundaries, boundaries[1:])]

On the eval set, parent-child with hybrid retrieval underneath scored 89% top-5 recall. Slightly below the reranker setup. The clerks scored it higher on usefulness, because the returned context was a complete clause they could read and understand rather than a fragment that needed assembly.

Latency stayed at 180ms.

The decision and what shipped

We ran the three candidates blind against the clerks for two weeks. Same UI, randomised which backend served each query. The clerks rated each response on a four-point scale. The hybrid-plus-reranker system won on recall by 2 percentage points. The parent-child system won on clerk satisfaction by 18 points.

We shipped parent-child.

The numbers from production, six months in:

Median query latency: 210ms.
Top-5 recall on a refreshed 200-question eval: 88%.
Average time-to-answer for the kind of question that opened this post: 14 seconds, down from a self-reported average of 8 minutes.
Clerks who asked the system at least one question last week: 34 out of 38.

The boring BM25 baseline still runs in production, by the way. Two reasons. First, for exact-phrase legal queries (statute references, clause titles, named contracts) it is faster and more accurate than any neural retriever we tested. Second, it is the canary. When the embedding pipeline misbehaves, BM25 keeps the office working.

Takeaway

A small evaluation set of real queries, built before you choose an architecture, is worth more than any benchmark.

Three things that did not matter

Worth listing the experiments that failed, because they take up most of the actual work.

Multi-vector embeddings (ColBERT-style) gave small recall gains on Dutch legal text and a large storage overhead. ColBERT shines on academic IR benchmarks. On our corpus it was a wash.

Query rewriting with a small LLM in front of the retriever. The hope was that an LLM could expand "the executor clauses from before 2010" into a richer query. In practice the rewrites introduced as many false leads as they fixed. We pulled it out after a week.

Chunking by sentence. Sounds clean. Destroyed conceptual recall because clauses lost their context. The 200-token child window aligned to sentence boundaries was the right unit.

What this looks like for your archive

If you have a document corpus and you are weighing RAG architectures, the order of operations matters more than the algorithm choice.

Build the eval set first. Eighty real queries from the people who will use the system. Pair each one with the document that should win. Without this you are guessing.

Then run BM25 with a careful analyzer. That is your floor. Most systems start here and never need to leave. If BM25 hits 80% on your eval set, the question is whether the remaining 20% is worth the operational complexity of anything more sophisticated.

If you need more, parent-child with hybrid retrieval is the next step. The cross-encoder reranker is the step after that, and only if your users can wait the extra second.

When we built the notary archive retriever, the thing we did not expect was how much of the work was in the chunker, not the model. The parent boundary logic was four times the code of the retrieval pipeline. That is the work that ages well, because it is rooted in the structure of the documents themselves rather than the embedding model of the month. When we build AI agents and RAG systems for clients now, the chunker is the first thing we design, not the last.

If you are starting out today, write 30 of your real questions on paper before you write any code. That alone will change the system you build.

Key takeaway

A small evaluation set of real queries, built before you choose an architecture, is worth more than any benchmark.

FAQ

Should I start with BM25 or jump straight to hybrid search?

Start with BM25 and a careful language analyzer. It is the floor every other approach has to clear. If it scores well on your eval set, the operational cost of dense retrieval may not be worth it.

How big should my evaluation set be?

Eighty real queries from the people who will use the system is enough to make a directional decision. Two hundred is enough to trust a small percentage-point difference. Anything under fifty is theatre.

When is a cross-encoder reranker worth the latency cost?

When your users will wait the extra second and you have already exhausted the gains from chunking. Reranking can lift recall five to ten points, but it doubles or triples cold-path latency and per-query cost.

Does parent-child chunking need a custom chunker?

For prose, no. For structured documents like legal deeds, statutes, or contracts, yes. The parent boundary should follow the document's own structure (clauses, articles, sections), not a fixed token window.

ragknowledge baseai agentsarchitecturecase studytooling

Building something?

Start a project