RAG

Image RAG pipelines: where each layer actually earns out

Image RAG looks simple on a whiteboard. In production, half the layers do real work, half are theatre, and the rerank quietly saves the whole thing.

Jacob Molkenboer· Founder · A Brand New Company· 24 Mar 2024· 6 min

Open oak index-card drawer with one chartreuse card upright, cream cards stacked beside it, brass divider, ivory paper surface.

A document agent for a Dutch insurance broker was quoting policy clauses from the wrong product on roughly one query in nine. The vector index was healthy. The embeddings came from a vision model that had benchmarked well. The wrong page kept ranking in the top three anyway. Two days of audit kept circling back to one question. Which layers of an image-RAG stack actually carry their own weight, and which ones are theatre?

The four-stage shape most image-RAG stacks settle into

After a year of shipping these for clients, almost every pipeline we've built converges on the same four stages. Render, embed, retrieve, rerank. The names change. The shape doesn't.

Render is the part where you turn whatever the source actually is (a PDF, a scanned contract, a Figma export, a screenshot of a Notion table) into a clean image of a page or a region. Embed turns each image into a vector. Retrieve does an approximate nearest-neighbour lookup against a query embedding. Rerank uses a heavier cross-attention model to reorder the top fifty hits before any of them ever reach the LLM.

That last step is where most teams underinvest. It's also the cheapest thing in the stack.

Where embeddings earn out

The quiet news of the last twelve months was ColPali. Instead of embedding a page into a single 768-dimensional vector, you embed it into a grid of patch vectors, then score a query against the grid with a late-interaction operator. The cost is real. Storage goes up by roughly a factor of one hundred. But the recall curves are not a fluke. On document pages with mixed text and figures, you stop missing the page where the answer sits inside a chart and not in the prose around it.

That's the embedding earning out. Not because the math is clever, but because a single vector per page was never enough to represent a page that contains a table, a paragraph, a header logo and three captions at once.

If you don't need page-level retrieval (your corpus is pure photographs, or pure UI screenshots), a single-vector CLIP or SigLIP embedding is still fine. There's no virtue in paying for ColPali storage if your documents have no layout to speak of.

What chunking means once images enter the picture

Chunking is the word that breaks down hardest in image RAG. For text, chunking is about segmenting a long document into retrievable units. For images, it means something almost opposite. You're not splitting an image up. You're choosing the grain at which a page is a unit, or a region is a unit, or a figure plus its caption is a unit.

We've settled on three grain sizes in production:

Page as the unit, for contracts, reports, and policy docs where the answer almost always sits inside one page.
Region as the unit, for scanned multi-column papers and old PDFs where the layout splits the answer across columns.
Figure plus caption as the unit, for technical documentation where the chart and its caption are the answer and the surrounding prose is filler.

The mistake we see most often is teams picking one grain size and forcing the whole corpus through it. The corpus is heterogeneous. The chunking should be too.

Warning

Don't do region detection by asking a vision LLM to "find the regions on this page". You'll spend more on inference than on the embedding model. Use a small open-source layout detector trained on PubLayNet or DocLayNet and keep the LLM out of this step.

The rerank as cheap insurance

Rerank is the step that gets cut first when the bill arrives, and the step we'd cut last. A small cross-encoder over the top fifty hits costs almost nothing. The lift is the difference between "the right page is in position seven" and "the right page is in position one". When you're feeding three results into an LLM context window, position seven is invisible.

Cohere, Voyage, and Jina all publish vision-aware rerankers in the same price range, on the order of a fraction of a cent per query at top-50. If your retrieve step is decent and your rerank step is missing, your pipeline is silently dropping correct answers because the right document never made the top three.

What we'd cut first if the budget shrank

Here's the order we'd cut things in.

The OCR step that runs in parallel with the image embedding. If your embedding model is ColPali or one of its successors, it's already reading text from the rendered page. The separate OCR pass is a belt-and-braces step that doubles your indexing cost for a marginal recall bump.
The metadata-extraction LLM pass. We've watched teams spend several hundred euros per ten thousand pages asking an LLM to extract "topic, author, date" from each page. The retriever doesn't need it. The user's query is already doing that filtering at query time.
The "rewrite the query into three queries" preprocessing step. Sometimes useful. More often a way to spend tokens for no measurable recall gain.

Don't cut rerank. Don't cut layout-aware chunking. Don't cut the render DPI.

The pattern is consistent across every pipeline we audit. The parts of the stack that look like they earn their keep (OCR, metadata extraction, query rewriting) often don't. The parts that look like overhead (rerank, render quality, layout detection) almost always do.

The audit you can run today

Sample twenty queries from your production logs. For each one, look at the top ten retrieved chunks before rerank, and the top three after. Count how often the correct chunk was in positions four through ten before rerank but in positions one through three after. That number is your rerank earning its keep. If it's above twenty percent, rerank is the cheapest win you have in the whole pipeline.

The broker we opened with had skipped rerank entirely and was forcing every document type through the same chunking grain. Two days of adding a vision rerank and grouping the chunking grain by document type cut the wrong-clause rate from eleven percent to under one percent. Most of our RAG and knowledge-base work follows that same shape: one or two missing steps doing most of the damage.

Pull the twenty queries tonight. Note the rerank delta. Let that single number tell you where the next euro of budget belongs.

Key takeaway

In image RAG, the parts that look like overhead (rerank, render quality, layout detection) usually earn their cost. The parts that look like work often don't.

FAQ

Do I still need image RAG if my model can read PDFs natively?

For one-off questions, no. For repeated queries over a large corpus, retrieval is still cheaper and more auditable. Native vision context cost scales linearly with corpus size; retrieval doesn't.

Is ColPali always better than single-vector CLIP?

No. For pure photo or UI-screenshot corpora without layout, single-vector CLIP or SigLIP is fine and roughly 100x cheaper to store. ColPali earns out on document pages with mixed text and figures.

How much does a rerank pass actually cost per query?

Vision-aware rerankers from Cohere, Voyage and Jina run a fraction of a cent per query at top-50. The cost is negligible compared with embedding storage or LLM inference.

What's the fastest way to tell if my pipeline is missing rerank?

Sample twenty production queries. If the correct chunk often appears in positions four through ten of your raw retrieval, you're losing answers a rerank pass would have saved.

ragknowledge baseai agentsarchitecturetooling

Building something?

Start a project