RAG

RAG over 41,000 Jupyter notebooks: a Groningen case study

A junior naval architect needed the 2017 ballast-tank calculation. It lived in one of 41,000 notebooks the seniors refused to migrate. Here is what we did.

Jacob Molkenboer· Founder · A Brand New Company· 18 Jul 2025· 11 min

Open oak index-card drawer with cream cards, one tagged chartreuse, brass divider, folded dark linen on ivory desk.

A junior naval architect at a Groningen consultancy opens her laptop at 09:14 on a Tuesday. The deliverable is a stability assessment for a 132-metre offshore service vessel. Somewhere in the firm's archive sits a 2017 ballast-tank calculation by a now-retired senior engineer who handled exactly this hull form. She knows it exists. She has been told three times to ask the notebook on the share drive. The share drive holds 41,000 .ipynb files going back to 2009.

That is the situation we walked into in March. The brief, after a morning of whiteboarding with the partners: a RAG agent over the full notebook archive that the seniors would not have to feed, retrain, or migrate into.

The notebook the seniors would not give up

The client is a 28-person consultancy near Zernike Campus that does structural and hydrodynamic work for offshore wind, dredging, and inland fleet operators. Their senior engineers work the way they have worked since 2011: Jupyter notebooks, pandas, NumPy, a hand-rolled finite-element wrapper around Code_Aster. Every project is a tree of notebooks. Every calculation is a cell. Every cell has an output: a pandas dataframe with the wall thickness per frame, a matplotlib plot of the bending moment, a single scalar that the classification society wants to see.

When we were called in, the firm was about to lose two senior engineers to retirement inside eighteen months. The juniors knew the theory. They did not know where to find the institutional answer. The proposal on the table was: migrate everything into a structured engineering database, retag, refactor, rewrite. Six-figure budget. Two-year horizon. The seniors had killed the proposal three times already. They liked their notebooks.

We agreed with the seniors.

Why a standard chunker fails on a Jupyter notebook

A .ipynb file is a JSON document. The naïve RAG ingestion path is: read JSON, concatenate all source and markdown cells, chunk by 1,000 tokens, embed, done. Plug that into any of the off-the-shelf RAG starters and you get something that demos well and is useless in production.

The reason is in the outputs. A ballast-tank calculation notebook from 2017 looks like this:

import pandas as pd
import numpy as np

# Tank geometry from drawing 17-042 rev C
tanks = pd.read_csv("tanks_17042c.csv")
tanks["volume_m3"] = tanks.apply(
    lambda r: r["length"] * r["beam"] * r["depth"] * r["k_fill"],
    axis=1,
)

ballast = tanks[tanks["service"] == "ballast"]
ballast.groupby("location")["volume_m3"].sum().round(1)

The reasoning is in the code. The answer is in the output below it, which the notebook stored as a stream like:

location
aft        1284.6
forepeak    412.1
midship    2106.3
Name: volume_m3, dtype: float64

If the chunker only reads source cells, the retrieval layer has the question but not the answer. If it reads raw cell.outputs, it gets bytes of base64 PNG plot data, ANSI-coloured tracebacks, and repr() strings that no embedding model handles cleanly. We tried both for a week. Neither worked.

Pandas outputs as first-class chunks

The thing we changed was treating each output as its own retrievable unit. A notebook is parsed into a sequence of typed chunks.

Markdown cell: kept as prose, embedded normally.
Code cell: source only, embedded with a small prefix noting the variables it defines.
DataFrame output: parsed back into a pandas table, then serialised to Markdown with column types and a one-sentence machine summary (Mean volume 1267.7 m³ across 3 ballast locations).
Plot output: caption extracted from the surrounding markdown plus a vision-model description of the rendered PNG.
Scalar output: indexed with the variable name on the left-hand side and the source cell as context.

Each chunk carries metadata: notebook path, cell index, project code, vessel class, original author, year, the SHA of the upstream CSV inputs. The vessel class and year matter a lot. A junior asking what ballast distribution did we use for a 130m PSV wants 2015 to 2019 work, not the 2009 barge studies.

The metadata schema took two iterations. The first version had author, year, project code, and vessel class. The second version added the SHA of every upstream CSV, the pinned library versions the notebook expected, and a free-text tag that the senior engineer could attach to a notebook on commit. Most never bothered. The two who did saved their junior colleagues hours of guessing. The hidden cost in any RAG system is metadata you discover you need three months after launch. Plan for it on day one and you spend an afternoon. Add it later and you reindex the corpus.

Takeaway

In a notebook corpus, the dataframe output is the answer. Indexing only the source code is like indexing the question and throwing away the response.

Pyodide as a deterministic re-render layer

A notebook is a snapshot. When you open a 2017 file in 2026, the outputs are still the ones the senior engineer saw on his Lenovo in November 2017. They might be wrong. The CSV may have moved. The library version is gone. The chart may be unreadable because matplotlib changed defaults.

We rebuild every output in a sandbox before indexing.

The sandbox is Pyodide, the CPython distribution that runs in WebAssembly. Each notebook is executed against a pinned environment, the outputs are captured, the embedded outputs in the file on disk are left untouched, and only the freshly-rendered ones go into the index. If a notebook fails to re-execute (broken path, missing input file, deprecated API), it gets flagged and stays in a separate stale bucket that retrieval can fall through to with a lower priority and a warning to the reader.

This was a build decision we revisited twice. The first version used a Docker pool with CPython. It worked, but spinning up containers for 41,000 notebooks was painful, and every CSV path inside the notebooks was a security headache. Senior engineers had baked in absolute paths to network shares like \\srv-calc-01\projects\2017\17-042\inputs\tanks_17042c.csv. Pyodide gave us the deterministic environment without the IO surface. We wrote a small import-hook that rewrites those UNC paths into a virtual filesystem mounted from object storage. The senior engineers' notebooks run unmodified. The pipeline never touches the live share, and the agent never gets a chance to leak a path that maps to a live drive.

The pace of the Pyodide project helps. The 314.0 release added the ability for Python packages to publish WebAssembly wheels directly to PyPI, which means our pinned environment file is now a normal requirements.txt instead of a custom asset bundle. We rebuilt the pipeline the week the release dropped.

Warning

Do not re-render production data into a public sandbox. Pyodide runs in the browser by default, which means the corpus has to be reachable by browser code. We run Pyodide inside a Node worker on our own infrastructure with no external network egress.

Retrieval that knows about cells

When a junior asks a question, the agent does not just embed the query and pull the top eight chunks. It does three things.

Embed the query against the chunk corpus and pull the top fifty candidates.
Group candidates by notebook. A single notebook with four matching chunks (one markdown, one source cell, one dataframe, one scalar) almost always beats four unrelated notebooks with one matching chunk each.
For the top three notebooks, fetch a structured neighbourhood: the markdown header above the matching cell, the cell itself, the next two cells, and any output the next two cells produce.

The fan-out to fifty before grouping matters. Early experiments embedded the query and pulled the top eight chunks directly. The eight came from eight different notebooks roughly half the time, and the agent stitched together a Frankenstein answer that no real engineer would write. Grouping by notebook before re-ranking was the simplest change with the biggest accuracy jump in our internal eval.

The response the engineer sees is a short answer plus a link that opens the notebook at the matching cell in a read-only viewer. We did not want to teach anyone to use JupyterLab again. The viewer shows the original 2017 outputs side-by-side with the re-rendered ones, with a diff badge when they differ. The seniors love the diff badge. It catches drift they would never have noticed on their own.

What the junior architects actually ask

Eight weeks in, the agent handled 1,260 queries in week eight. We sampled 200 of them by hand.

About 35% are where is the calculation for X. A junior knows the project exists and wants the notebook. About 28% are what value did we use for X in similar work. The junior is doing their own calculation and wants a sanity check. About 18% are vocabulary questions: a senior used a term (equivalent design pressure, GZ-arm at 30 degrees) that did not make it into the textbook. The rest is a long tail.

The 200-question audit found three wrong answers. All three were the same failure mode: the agent retrieved a notebook from 2014 that used a hull form coefficient table the firm has since abandoned. We added a deprecated_coefficient flag to the notebook metadata and excluded those from default retrieval. The fix was an afternoon.

The thing we did not predict: the seniors started using it too. Not for fresh questions. For what did I write in 2019. One of them admitted he uses it as an external memory for his own work.

The numbers, eight weeks in

41,107 notebooks indexed. 38,924 re-rendered cleanly in Pyodide, 2,183 in the stale bucket.
Median query latency: 1.4 seconds to first token, 4.8 seconds to full answer with the notebook neighbourhood fetched.
Query volume week eight: 1,260, up from 84 in week one. The growth is juniors telling each other in Slack.
Average time-to-answer for where is the X calculation: from 22 minutes (Slack-asking a senior, waiting) to under 2 minutes.

Cost ran about €0.011 per query on average, dominated by the answer model and the per-cell vision descriptions we cache at ingest time. The full corpus rebuild costs €38 if we re-render everything from scratch, which we do nightly because notebooks committed during the day can fail in surprising ways and we want to know inside twenty-four hours, not at the end of a sprint.

The migration project that had been on the table for three years is now off the table. The seniors still write notebooks. The juniors find them. The institutional memory survived two retirements that have not yet happened.

What this generalises to

If you are sitting on a corpus of notebooks, CAD drawings, or Word documents the senior engineers will not give up, the lever is not migration. The lever is treating the outputs as first-class. When we built this RAG agent for the Groningen consultancy, the move that mattered was Pyodide-based re-rendering plus indexing the pandas dataframes as their own chunks. Pick the ten most-asked engineering questions from your last quarter and check whether any of the answers live in a dataframe nobody is searching today. That is your five-minute audit.

Key takeaway

When your knowledge base is Jupyter notebooks, the answers live in the pandas outputs. Index those as first-class chunks, not the source cells.

FAQ

Why not just migrate the notebooks into a structured database?

Because the seniors will not write into one. Every migration project at this firm died because the people producing the knowledge refused to change tools. RAG over the existing corpus survives that refusal.

Why re-render outputs with Pyodide instead of trusting what is in the .ipynb file?

Saved outputs are 2017 snapshots. Inputs move, libraries change, and stale outputs poison retrieval. Re-running in a pinned WebAssembly environment gives deterministic, current answers.

How do you handle plots and figures in retrieval?

Each plot output gets a vision-model description plus the surrounding markdown as caption, indexed as its own chunk. The junior gets the figure rendered inline and can open the notebook at that cell.

What stack does the agent run on in production?

A Node worker pool runs Pyodide for re-rendering, chunks land in a Postgres + pgvector index, retrieval is hybrid (BM25 plus dense), and the chat layer sits behind the firm's existing SSO.

ragai agentscase studyknowledge basearchitectureoperations

Building something?

Start a project