RAG
Image RAG for floorplans: sketch search across 40k PDFs
An architecture studio with fourteen years of work sitting in 40,000 floorplan PDFs asked us a small question: could they search the archive by drawing?

On a Tuesday morning in February, the principal at a Rotterdam architecture studio opened a brief from a client who had walked out of her office an hour earlier. The client wanted a wedge-shaped house with a courtyard cut into the south corner. She remembered drawing something close to that in 2018. She did not remember the project name, the client, or the year. Her Mac's search box gave her file names. Her DMS gave her tags. Neither knew what a wedge with a courtyard looked like. She had 40,000 floorplan PDFs sitting on a NAS, fourteen years of accumulated work, and no way to ask the archive "show me everything that looks like this."
This is the story of what we built for her, and the three or four times it failed in interesting ways before it stopped failing.
What the studio actually asked for
The brief was three sentences. "We have 40,000 PDFs. We want to search them by drawing on a canvas. Can you build it." We spent the first call asking the obvious follow-up: did they want similar shapes, similar programs, or similar visual styles. They said yes to all three and were not sure which mattered more. That answer turned out to be the most important data point in the whole project.
The archive itself was uneven. About 12,000 PDFs were vector exports from Vectorworks and ArchiCAD with intact layer metadata. Another 18,000 were flat raster scans of older drawings, some of them photographs of paper plans pinned to a wall. The remaining 10,000 were a mess of competition submissions, sketches in PDF form, and a handful of files that pikepdf had to repair before anything else could open them.
Why CLIP collapsed on the first test
The instinct with any image-RAG project in 2026 is to reach for CLIP or a CLIP-family model, embed everything, store the vectors, and call it done. There is a reason this works for product photos and stock imagery, and a reason it falls apart on floorplans. CLIP is trained on image-text pairs scraped from the open web. The visual vocabulary it learns is photographic. A line drawing of a building footprint occupies almost none of its training distribution.
We tested it anyway. We embedded 500 plans with OpenCLIP ViT-L/14, embedded a few sketches the principal drew on an iPad, and asked for nearest neighbors. The top results for a wedge-shaped sketch were two pizza slices, a slice of cake, and a top-down photo of a swimming pool. The principal laughed, then asked, politely, whether we could try something else.
The pattern shows up in any field where the visual vocabulary drifts away from web photography: medical scans, satellite imagery, CAD output. CLIP is a starting point, not an answer, when your domain is not photographic.
Pulling 40,000 plans out of PDFs without losing the good ones
Before any model touched the data, we had to render the right page of each PDF as a clean raster. Architecture PDFs are pathological. The "plan" can be on page 1, page 7, or page 23. The same PDF often contains site plans, floor plans, sections, elevations, and a title block, and we wanted only floor plans. Vector PDFs render crisp at any DPI; scanned PDFs are noisy and skewed.
The pipeline ended up like this:
#!/usr/bin/env bash
# render-plans.sh - render every page of every PDF at 300dpi, then classify
for pdf in /archive/**/*.pdf; do
base=$(basename "$pdf" .pdf)
out="/render/$base"
mkdir -p "$out"
pdftoppm -r 300 -png "$pdf" "$out/page" 2>/dev/null \
|| (pikepdf --replace "$pdf" /tmp/fixed.pdf \
&& pdftoppm -r 300 -png /tmp/fixed.pdf "$out/page")
done
Classifying which rendered page is the floor plan was its own small model. We trained a lightweight ResNet-18 on 1,200 hand-labeled examples across six classes: floor plan, site plan, section, elevation, title block, other. Accuracy on the held-out set landed at 94%, which was enough. The 6% that misclassified mostly mixed up site plans and floor plans, which both turned out to be searchable in the end, so we kept them both.
Do not skip the page-classifier step. We tried "just embed every page" first. The vector store ballooned from 40,000 entries to roughly 380,000, retrieval quality dropped, and the architects kept getting title-block matches when they sketched a shape.
The embedding stack we landed on
We tested seven embedding strategies. The two that worked, in order of contribution:
First, DINOv2 base (ViT-B/14) as the backbone. DINOv2 is trained with self-supervised objectives on natural images, but unlike CLIP it does not depend on text captions, and its features turn out to respond well to structure and topology. Line drawings still sit outside its training distribution, but a small adapter head closes the gap.
Second, a 3-layer MLP adapter we trained on top of DINOv2 features using triplet loss. We hand-labeled 600 triples (anchor plan, structurally similar positive, structurally dissimilar negative) over two afternoons with the studio's junior architects. Three hundred more came from synthetic augmentation: rotations, mirrors, and stroke-width perturbations of each anchor.
The third piece, which came later and changed the project, is below.
import torch, torch.nn as nn
from transformers import AutoModel, AutoImageProcessor
backbone = AutoModel.from_pretrained("facebook/dinov2-base")
proc = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
class PlanHead(nn.Module):
def __init__(self, d_in=768, d_out=256):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_in, 512), nn.GELU(),
nn.Linear(512, 384), nn.GELU(),
nn.Linear(384, d_out),
)
def forward(self, x):
return nn.functional.normalize(self.net(x), dim=-1)
head = PlanHead()
@torch.no_grad()
def embed(pil_image):
px = proc(images=pil_image, return_tensors="pt").pixel_values
feats = backbone(px).last_hidden_state[:, 0] # CLS token
return head(feats).cpu().numpy()
Vectors went into Postgres with pgvector. The studio already ran Postgres for their project tracker; adding the extension was a one-line change. We considered Qdrant and Weaviate and decided against bringing in a second database the studio's IT contractor would have to learn.
From sketch canvas to query vector
The interface is a 1200x900 canvas in the browser. The architect draws with a pressure-sensitive stylus or a mouse. Every stroke gets pushed to a queue; when the user pauses for 400ms, we rasterize the canvas to a 518x518 grayscale PNG, dilate the strokes by 2 pixels to match the visual weight of archived plans, run it through the same embedder, and query pgvector for the top 24 nearest neighbors.
SELECT plan_id, file_path, page,
1 - (embedding <=> $1::vector) AS similarity
FROM plan_embeddings
ORDER BY embedding <=> $1::vector
LIMIT 24;
Median query time on the studio's small Hetzner box, with an HNSW index over 40,000 vectors, is 38ms. They are not going to need a GPU for retrieval. Embedding the sketch on CPU takes another 120ms, which is the dominant cost. We considered moving inference to ONNX runtime to shave that down and decided 160ms total felt instant enough on a stylus.
The week-three problem
At the end of week three, our internal benchmark looked great. We had 92% top-5 retrieval on a test set the studio had labeled. We demoed it. The architects hated it.
The shape-similarity model was doing exactly what we trained it to do, which was the wrong thing. The principal sketched a wedge with a courtyard. The system returned six other wedges. None of them had courtyards. The model had learned that the dominant signal was the silhouette, and the courtyard cut, which was the entire point of the sketch, was a small notch the embedding treated as noise.
What the architects actually wanted, when they pushed past the way they had phrased the brief, was program-similar plans. Two bedrooms, one bathroom, a kitchen open to the living room, a small entry, a service core. Shape was a proxy for program, not the goal.
The 12,000 vector PDFs with layer metadata gave us a way in. We parsed those layers (mostly an XML walk through a DXF intermediate), built a room-adjacency graph for each plan, embedded the graph with a tiny graph neural network, and concatenated the result with the shape vector. We weighted the two halves with a slider the architect could move, labeled "shape to rooms." Default 50/50. Most users push it to about 30/70 within their first session.
The brief said "search by sketch." The honest brief was "search by program, but I am sketching because that is how I think." Find the gap between those two before you ship.
What it costs to keep running
Indexing the full 40,000 plans took 9 hours on a single L4 GPU we rented for the initial build. Re-indexing on incremental updates runs nightly on the studio's own machine and finishes in about four minutes for the 30 to 60 plans they add per day.
Storage is small. Each plan is a 256-dimension float vector plus metadata: roughly 1.5KB per plan, so under 60MB for the whole archive. The raster previews are larger, around 180GB, sitting on cheap object storage.
Inference at query time runs on the studio's existing Ryzen workstation. They did not buy a GPU. They will not need to for another order of magnitude of growth.
What we would do differently
If we started today, we would skip the OpenCLIP test. We knew it would fail; we ran it anyway because the studio asked. Forty minutes of work, but we could have spent them on the page classifier, which is what the project was actually short on.
We would also fine-tune the DINOv2 backbone, not just the adapter head, on a few thousand plans. The adapter was the fast way to get a result; full fine-tuning would have pushed the top-5 number above 95% on programs without us needing the graph piece. We did not, because the studio wanted to ship in six weeks.
And we would build the program-similarity layer first, not third. Almost every architect we have talked to since this project has confirmed: they think in rooms, not silhouettes.
If you want to try this on your own archive
The smallest thing you can do today, if you have a folder of PDFs and a hunch that an image-RAG layer would help: run pdftoppm -r 200 -png yourfile.pdf out on a sample of fifty files and look at how many pages per PDF you actually need. If the answer is "one or two out of twelve," you have the same page-classifier problem we had, and that is the part to budget for first.
When we built the floorplan search for that Rotterdam studio, the thing we ran into was the gap between what the brief asked for and what the work actually needed. We solved it by building two embedding spaces and letting the user slide between them. That kind of problem, where the spec is a proxy for the real spec, is most of what we deal with when we build AI agents and RAG systems for studios sitting on a decade of accumulated work.
Key takeaway
The brief said search by sketch. The honest brief was search by program. Find that gap between the spec and the work before you ship.
FAQ
Why did CLIP fail on floorplans?
CLIP is trained on image-text pairs from the open web, so its visual vocabulary is photographic. Line drawings of building footprints sit almost entirely outside that training distribution, and the nearest neighbors come back as pizza slices and pool photos.
Why use DINOv2 instead of CLIP for line drawings?
DINOv2 is trained self-supervised on images without depending on captions, so its features respond more to structure and topology than to semantic labels. A small adapter head on top closes the remaining gap to line art.
Do you need a GPU to run image RAG at query time?
Not for 40,000 vectors. We run pgvector with an HNSW index on a small Hetzner box; median query is 38ms and sketch embedding takes another 120ms on CPU. A GPU is only needed for the initial bulk index.
What is the hardest part of building image RAG for PDFs?
Picking the right page. Architecture PDFs mix plans, sections, elevations, and title blocks. Without a page-classifier the vector store balloons and retrieval quality drops sharply.
Can you search by shape and by program at the same time?
Yes. We embed shape with DINOv2 plus an adapter, embed room-adjacency graphs with a small GNN, concatenate the two, and let the user weight them with a slider. Most architects settle around 30% shape, 70% program.