RAG

RAG on 14,000 MATLAB scripts: a Delft hydraulics rebuild

A junior engineer needed to re-run a 2019 pump-curve fit. The script lived on a Citrix VM nobody had touched in eleven months. We built a way out without a VPN.

Jacob Molkenboer· Founder · A Brand New Company· 15 Jun 2026· 9 min

Open oak card-catalogue drawer with cream index cards, one tagged chartreuse, brass divider, ledger papers on ivory paper.

The junior engineer's message came in at 22:47 on a Wednesday. Her team had been asked to verify a pump-curve fit produced in 2019 for an Eastern Scheldt maintenance brief. The original author was on parental leave. The script lived on a Citrix VM the firm had not logged into for eleven months, on a Windows Server box that still ran MATLAB R2018b under a per-seat license that had lapsed.

She could read the script. She could not run it.

That was the situation the bureau called us about. Twenty-one engineers in Delft, three decades of hydraulics work, and a research archive of roughly 14,000 MATLAB and Octave scripts that nobody could touch without a VPN tunnel and an admin reset. Six months after we shipped, the agent over that archive answers 940 questions a week, and the answers include re-run output, not just citations.

A chat box was not the deliverable

The first instinct, when a firm has fourteen thousand scripts and one search bar, is to point a retrieval agent at the codebase and call it done. We did the cheap version of this in week one as a sanity check. It returned the right script for the prompt "pump curve, centrifugal, 2019" four out of five times. The engineers' response was: yes, we already know how to grep. What we cannot do is run the thing.

The actual problem was not finding the script. The problem was that the answer to most questions in this bureau is not a paragraph. It is a number, a plot, or a fitted curve. A retrieval agent that returns the matching .m file and a confidence score is, to a working engineer, a slightly faster grep.

So we changed the brief. The agent had to find the relevant script, read it well enough to identify inputs, run it against the right inputs, and return the output the engineer would have produced by hand. That last point reframed the whole stack.

The shape of the corpus

The archive was three decades of work split across three dialects.

Roughly 9,300 files were modern MATLAB (R2014b and later, heavy use of the Signal Processing, Curve Fitting and Simulink toolboxes). About 3,100 were older MATLAB (pre-2010, single-letter variables, no comments, the kind written by someone who knew the equations cold). About 1,600 were Octave scripts written by PhD students between 2008 and 2016 who did not have MATLAB licenses at home.

There was overlap. A given pump-curve calculation might exist as a 2009 Octave prototype, a 2014 MATLAB rewrite for client delivery, and a 2019 hot-fix branched from the 2014 version with the wrong author name in the header. None of the three was canonical, and the engineers disagreed about which to trust.

We did not try to deduplicate the corpus. We indexed all of it and let provenance fall out of the retrieval. The agent surfaces the most-cited version first, then the most recently edited, and shows a small diff hint when two versions exist for the same problem.

Pyodide in the browser, not on the server

The instinct here is to put a Python execution sandbox on a backend container, route requests through it, and call it modern. We had three reasons not to.

First, license posture. Running MATLAB itself on a server we control means buying floating licenses for an automated user. MathWorks does not love this and the bureau's procurement team did not want the conversation. Octave is fine, but Octave alone cannot cover the toolbox-heavy 2019 scripts.

Second, blast radius. An engineer-triggered execution that hits the production cluster is one bad regex away from a load problem. We wanted the execution sandbox to die when the user closed the tab.

Third, observability. We wanted the engineer to see exactly what was running and modify it before re-running. A black-box server step does not give you that.

So the execution layer is Pyodide, the CPython build that runs in the browser via WebAssembly. The retrieval pipeline finds the relevant script. The translation pass converts it to Python with NumPy, SciPy and Matplotlib equivalents. Pyodide runs the translated code in the engineer's browser tab.

That means: no VPN, no Citrix, no MATLAB license touching a server we run, no production execution surface. The engineer gets the output in the same chat panel that returned the citation. If she wants to tweak an input, she edits a code block and re-runs in place.

<script type="module">
  import { loadPyodide } from "https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.mjs";
  const py = await loadPyodide();
  await py.loadPackage(["numpy", "scipy", "matplotlib", "pandas"]);
  const src = await fetchTranslated(scriptId, inputs);
  const out = await py.runPythonAsync(src);
  renderPlot(out);
</script>

The Octave bridge

The translation layer is the messy part, and the part most teams underestimate.

Modern MATLAB and Octave overlap heavily for the kind of work this bureau does: linear algebra, signal processing, curve fitting. But the overlap is not clean. A few examples we hit in the first month.

MATLAB's fit() from the Curve Fitting Toolbox has no one-line Octave equivalent. We map it to scipy.optimize.curve_fit with a wrapper that mimics MATLAB's fittype syntax.
xlsread and xlswrite behave differently across MATLAB versions and not at all in Octave. We replaced both with pandas.read_excel.
Anonymous function handles (@(x) x.^2) translate directly to Python lambdas, but only if the element-wise operators also get rewritten (.^ to **, .* to * on NumPy arrays).

We did not write a universal translator. We wrote a translation pass that handles the constructs that actually show up in this bureau's scripts, validated with a regression set of 320 paired MATLAB/Python outputs from the engineers themselves. The translator passes when the numerical output matches a tolerance the engineers agreed on per script type: 1e-6 for linear algebra, 1e-3 for fitted curves, exact match for indexing operations.

Retrieval over scientific code

The retrieval choices were not exciting and that was the point.

We chunked at the function level, not the file level. A 2,400-line MATLAB script that defines twelve helper functions becomes twelve chunks plus a top-level chunk for the orchestration. We tried file-level chunking first and it did badly. The embedding model could not distinguish "this file contains pump-curve fitting code" from "this file contains pump-curve fitting code among many other things."

We added a structured metadata layer next to the embeddings: author, last modified, toolbox dependencies, variable names that appear in headers. This lets the agent filter ("only scripts that use the Signal Processing Toolbox") before the semantic match runs.

We use pgvector for the index, sitting in the same Postgres instance as the metadata. This was deliberate. We had used dedicated vector stores on earlier projects and the operational cost of running another stateful service for a team of twenty-one was not justified. There is one Postgres to back up, one to monitor.

The Postgres choice also forced an honest conversation about deletes. When a script is removed or replaced in the archive, we rewrite the rows rather than chain individual deletes, because on a busy table with a large index, a hot loop of deletes is operationally painful. The recent debate around scalable deletes in Postgres (the punchline being that the only truly scalable delete is DROP TABLE) is not our problem at this scale, but it shaped how we modelled replacement. Bulk swap is the default. Full deletes are a rare maintenance event.

940 questions a week

After six months, the agent receives between 880 and 1,050 questions per week, with a strong Tuesday and Wednesday bias. Roughly:

58% are retrieval-only ("which script computed the 2014 storm-surge envelope for the IJmuiden tidal model")
31% are retrieval plus single-pass execution ("re-run that script with these new headwater levels")
9% are multi-step ("re-run it, plot the residuals against the 2014 plot, and tell me whether the discrepancy is greater than two standard deviations")
2% are declined because the script depends on a toolbox call we have not translated yet

The 2% bucket is the most interesting one for us. It is the queue that tells us what to translate next.

The principal engineer told us, after the third month, that the hierarchy of who-asks-whom in the office had shifted. Juniors were now asking the agent the questions seniors used to answer in the corridor. Seniors were asking the agent the questions they would otherwise have handed to an intern. That was not a metric we were tracking, but it tracks.

What we would do differently

Three things.

We would build the translation regression set first. We started with retrieval, then bolted execution on, then realised we needed a paired-output validation set to know whether the translation was actually correct. Building that set retroactively cost us three weeks. If you are doing this, build the regression set in week one with the engineers in the room.

We would have started Pyodide work earlier. The first prototype ran translated Python on a small server. The latency was fine. The procurement and audit conversation around the server was not. Moving to in-browser execution killed two months of compliance review that we should not have entered into. Pyodide already ships most of SciPy, and load time on a modern laptop is acceptable for engineering work.

We would have indexed the git history, not just the files. Half the questions the engineers ask have an implicit time dimension: "the version we sent the client in March 2019" or "before we changed the head-loss coefficient." We added git-aware retrieval in month four. It should have been there in week one.

Warning

Do not promise an in-browser execution layer before you have measured cold-start time for your specific package list. Pyodide plus NumPy plus SciPy plus Matplotlib is a few seconds on a modern laptop. Add a heavier toolbox and the conversation changes.

The smallest thing you can do this week

If you have an archive like this sitting on a legacy box, the smallest useful thing you can do today is count the files, count the dialects, and write down the last-modified date of the oldest one anyone actually opened in the past six months. That number tells you whether you have a knowledge base or a graveyard.

When we built this RAG agent for the Delft team, the thing we did not expect was how much of the work was not retrieval but execution: getting the answer the engineer would have produced by hand, in the same panel, in a browser tab. We ended up solving it by moving the runtime into Pyodide and writing a translation pass narrow enough to actually validate against engineer-paired outputs.

Key takeaway

RAG over scientific code is not a retrieval problem. It is an execution problem. Return the number the engineer would have produced, in the same panel.

FAQ

Why run Python in the browser instead of on a server?

License posture, blast radius and observability. No MATLAB-equivalent license touches our servers, the sandbox dies with the tab, and the engineer can read and edit the code before re-running it.

How do you keep MATLAB to Python translation correct?

A paired-output regression set built with the engineers. The translator passes when numerical outputs match an agreed tolerance per script type: 1e-6 for linear algebra, 1e-3 for fitted curves, exact for indexing.

Why pgvector instead of a dedicated vector database?

One stateful service to run for twenty-one engineers. The metadata that drives pre-filtering already lives in Postgres, so co-locating the embeddings removed an operational dependency rather than adding one.

What happens when a script depends on a toolbox call you have not translated?

The agent declines and logs the call into a translation queue. That queue is roughly 2% of weekly traffic and it tells us which toolbox surface to translate next.

How long did this take to ship?

Roughly fourteen weeks from kickoff to the bureau-wide rollout, with another four weeks of follow-up to add git-aware retrieval and tighten the translation pass.

ragcase studyknowledge baseai agentsarchitecturelegacy sites

Building something?

Start a project