RAG

Chunking strategies for RAG: nine tested on 92k Dutch docs

Nine chunking strategies. One 92,000-document Dutch municipality archive. Real test queries from working archivists. Here is what moved recall, and the technique that beat hybrid BM25.

Jacob Molkenboer· Founder · A Brand New Company· 14 Aug 2024· 9 min

Open oak index-card drawer on ivory desk, one tab raised with green paper flag, brass divider, stack of cream paper.

It is a Tuesday morning at a municipal archive in the Randstad. An archivist types a query into a beta search tool we built: welke moties zijn er aangenomen over windenergie tussen 2018 en 2022? The system has to read back across 92,000 documents. Council resolutions, moties, college letters, zoning plans, most of them PDFs, some scanned with bad OCR, a handful still in DOCX from 2009. The wrong chunk boundary means the right motie sits two paragraphs away from whatever vector the search returns first. Recall is the whole game.

We spent the spring of 2026 benchmarking nine chunking strategies on that corpus. Same embedding model, same reranker, same query set. Only the chunker changed. This is the field guide we wish we had had on day one.

The corpus and the queries

The archive holds 92,184 documents from a single Dutch gemeente going back to 2002. Median length is around 1,400 words; the tail runs past 80,000 words for omgevingsvergunningen with appendices. Roughly 71% are born-digital PDFs, 22% are scanned PDFs (we re-OCRed everything with Surya before indexing), and the rest is a mix of DOCX, ODT, and HTML pulled from the old Joomla site.

Our evaluation set is 612 queries written by three working archivists, each tied to between one and twelve ground-truth documents they had already located by hand. The metric is recall@10 at the document level: did at least one chunk from a target document make the top ten?

The embedding model is held constant at jina-embeddings-v3, a multilingual model with an 8,192-token context window. The retriever is a flat HNSW index. The reranker is BGE-reranker-v2-m3 over the top 50. Nothing in that stack changes between runs. Only the chunker moves.

The baseline you have to beat

Most production RAG systems we audit at municipalities and law firms run a hybrid retriever: BM25 over chunks plus dense embedding retrieval, fused with reciprocal rank fusion. That is the bar. On this corpus, hybrid BM25 plus dense (chunks at fixed 512 tokens) lands at recall@10 = 0.78. Anything below that is a regression; anything above it has to justify the operational cost.

Nine chunkers, head to head

Each chunker was tuned for at least one full day, then locked. Numbers below are document-level recall@10 against the 612 labeled queries. Same dense-only retrieval, no hybrid, no reranker, so you can see what the chunker itself is doing.

#	Chunker	Notes	Recall@10
1	Fixed 512 tokens, no overlap	The strawman.	0.58
2	Fixed 512 tokens, 64-token overlap	Old default.	0.63
3	Recursive character splitter	LangChain-style, 1000 / 200.	0.66
4	Sentence-aware (Dutch spaCy)	Pack 4 to 6 sentences.	0.61
5	Paragraph-aware	Native paragraph breaks, max 800 tokens.	0.65
6	Sliding window 1024 / 256	Wider window, more redundancy.	0.69
7	Semantic (embedding-drop)	Cosine breakpoint between sentences.	0.71
8	Structure-aware (headings)	Uses besluit / motivering / bijlage outline.	0.74
9	Late chunking	Embed first, chunk second.	0.81

Hybrid BM25 plus dense over strategy #1 hits 0.78. Late chunking on its own, with no BM25, beats it. With BM25 added on top, late chunking gets to 0.86. That is the headline.

Why fixed chunking lost (and why everyone still ships it)

Fixed-size chunking is in every quickstart for a reason: deterministic, fast, trivially parallel. It also throws away every signal the document gave you. A motie in a Dutch council resolution has a predictable shape: Beslispunt, Motivering, Stemverhouding. Chop that into 512-token bricks and the beslispunt ends up severed from its motivering half the time. The embedding is then a soup of two unrelated arguments.

The cheapest fix is also the most underrated: walk the document outline first. Strategy #8 (heading-aware) was four days of work, mostly regex and a markdown converter, and it added 16 points of recall over the fixed baseline. If you cannot ship anything else this quarter, ship that.

Late chunking, in one paragraph

Late chunking flips the usual order. Instead of slicing the document into chunks and then embedding each one in isolation, you embed the whole document once under full attention, then pool token-level embeddings into chunks afterwards. Every chunk's vector now carries context from the full document. The Stemverhouding chunk at the bottom of a motie knows it belongs to that motie. The technique was formalised by Jina AI in Günther et al., 2024, and it only works with long-context embedding models. Below 8k context, it is not interesting.

from transformers import AutoModel
import torch

model = AutoModel.from_pretrained(
    "jinaai/jina-embeddings-v3", trust_remote_code=True
)
tok = model.tokenizer

# 1. Tokenise the whole document once.
doc = open("besluit-2022-431.txt").read()
inputs = tok(doc, return_tensors="pt", truncation=True, max_length=8192)

# 2. Token embeddings under full-document attention.
with torch.no_grad():
    hidden = model(**inputs).last_hidden_state[0]

# 3. Pick chunk boundaries however you like (headings work well).
boundaries = pick_boundaries_from_outline(doc, tok)  # [(start, end), ...]

# 4. Mean-pool token vectors inside each chunk.
chunk_vectors = [hidden[s:e].mean(dim=0) for s, e in boundaries]

That is the whole trick. The boundaries can come from any of the other eight strategies. We used heading-aware boundaries inside the late-chunking pipeline, which is why strategy #9 reads as a superset of strategy #8.

Warning

Late chunking is not free. Embedding a 60,000-token bestemmingsplan in one pass costs roughly 14x more GPU-seconds than embedding it as 120 isolated 512-token chunks. Budget for it, or batch your long documents overnight.

Where the other contenders broke

A few results surprised us enough to be worth flagging.

Sentence-aware did worse than paragraph-aware

We expected the Dutch spaCy sentence splitter to outperform paragraph splitting. It did not. Council resolutions average 38 words per sentence and rarely fewer than three sentences per paragraph; packing six sentences into a chunk gave us awkward boundaries where two thirds of a motie sat in one chunk and the dispositive line (draagt het college op) in the next.

Semantic chunking was inconsistent

Embedding-drop chunking (#7) looks great on a clean policy memo and falls apart on a scanned 2009 PDF with OCR noise. The cosine "boundaries" landed inside sentences, around OCR garbage tokens. We retried with a cleaning pass and recovered most of the gap, but never beat structure-aware.

Sliding windows traded recall for latency

The sliding window at 1024 / 256 helped recall (0.69) but tripled the index size, and reranking the top-50 became the new bottleneck. On a four-year-old archive this matters less; on a live corpus that grows by 1,200 documents a month, it is the difference between a €180 / mo and a €640 / mo Postgres bill.

What to try first, in order

If you are currently shipping fixed 512-token chunks with no overlap, here is the order we would hand a small team.

Add overlap (64 tokens). One-line change. Buys you about 5 points.
Walk the document outline. Headings, numbered sections, frontmatter. About 10 to 12 more points.
Add hybrid BM25 if you do not already have it. About 4 points and it covers the cases dense retrieval misses (proper names, dossier numbers).
Only then consider late chunking. It needs a long-context embedding model, a GPU budget, and a re-indexing window. The recall gain is real, but it is the last 5 points, not the first.

Takeaway

Document structure beats clever embeddings nine times out of ten. Use the headings the author already gave you before reaching for semantic tricks.

The cost question nobody asked us

One thing the benchmark did not capture: tokenisation cost shifts at scale. Re-embedding the 92,000-document archive under late chunking with jina-embeddings-v3 ran us roughly €240 of GPU time on a rented A100, against €18 for the fixed-chunk baseline. For the gemeente, that is pocket change against an archivist's hour. For a SaaS pricing model where you ingest user docs on the fly, it is a different conversation. Smaller, quantised models like Google's Gemma 3 QAT release are starting to make on-laptop inference plausible for the embedding side too. We are watching that closely for the next benchmark.

What we shipped

When we built this for the gemeente, the thing we ran into was not the chunker. It was the OCR. About 22% of the archive came in as scans, and the recall difference between Tesseract and Surya was bigger than the difference between any two chunkers on this list. We ended up running Surya twice on anything that looked uncertain and reconciling the outputs. That is the unglamorous half of every RAG system we put into production.

If you only do one thing today, open three documents from your corpus and look at where the headings are. If the structure is there, your chunker should be using it before you read another paper.

Key takeaway

Document structure beats clever embeddings nine times out of ten. Use the headings the author already gave you before reaching for semantic tricks.

FAQ

Why did late chunking beat hybrid BM25 plus dense on this corpus?

Hybrid retrieval helps when chunks are themselves weak. Late chunking makes the chunks stronger, because each vector carries full-document context, so dense alone matches what hybrid was patching.

Do I need jina-embeddings-v3 specifically, or does any embedding model work?

You need a long-context model with bidirectional attention. Below 8k tokens of context, late chunking has nothing to pool from. Jina v3, Nomic v2, and BGE-M3 all qualify; most short-context models do not.

Is heading-aware chunking enough if I cannot afford late chunking?

On structured corpora (legal, policy, technical docs), yes. Heading-aware took us from 0.58 to 0.74 recall@10 against minimal GPU spend. Late chunking adds another 7 points on top.

How big does the corpus need to be before this benchmark applies?

We saw the same ordering at 5,000 documents and at 92,000. The recall numbers shift, but heading-aware always beat fixed-size and late chunking always topped the table.

What about OCR quality?

OCR matters more than chunking strategy on scanned corpora. We saw bigger recall swings between Tesseract and Surya than between any two chunkers on this list. Fix OCR first.

ragknowledge basearchitecturecase studytoolingai agents

Building something?

Start a project