← Blog

AI agents

Contract-review agents: what Stanford's law result missed

Stanford reports a language model outscoring law professors on legal analysis. Watch an SME firm work past 9pm and you'll find the real bottleneck sits somewhere else entirely.

Jacob Molkenboer· Founder · A Brand New Company· 14 Mar 2024· 6 min
Folded legal document on ivory desk with brass relay switch, fountain pen on leather blotter, green sticky tab.

It is 21:40 on a Tuesday in a four-partner firm above a bookshop on the Oudegracht. The youngest associate has eighteen NDAs in front of her, all variants of the same template from one Bay Area counterparty, and she is reviewing each contract against the firm's house redlines by eye. The senior partner left at six. The Stanford RegLab result that hit the front page of Hacker News this week says a frontier model now scores higher than law professors on legal analysis. The associate has not read it. She is on NDA seven.

Contract-review agents are quietly running on backstage servers at a small but growing number of mid-size law firms across the Netherlands and Belgium. When the Stanford finding landed on Wednesday morning, the conversation inside those firms was not "do we still need associates." It was: does this thing now do the boring eight tenths of our weekly work, or are we still where we were last quarter.

What the Stanford study actually measured

The headline that travelled was "AI outperforms law professors." The thing being measured is a set of legal analysis prompts scored by other experts. It is a test of reasoning on cleanly framed questions, not a test of operating inside a firm. That distinction matters because most of the value a small law firm could capture from agents lives outside the benchmark.

A separate paper from the same Stanford group last year tested commercial legal AI tools on practitioner-style queries and found hallucinations on roughly one in six. The same lab. Two different framings. Both worth taking seriously, neither one is a permission slip.

The work an SME law firm does on a Tuesday

Walk through the office at 09:30 and write down every task that touches a contract. At the firms we have observed up close the list looks like this:

  • Read an inbound NDA, compare it to the house template, mark the deltas.
  • Check that the indemnity cap matches what the partner agreed by email last Friday.
  • Pull every employment agreement signed in 2024 that contains a non-compete and flag the ones longer than twelve months.
  • Translate clause 14.3 of a Dutch SaaS contract for a German subsidiary.
  • Confirm that the governing-law clause in this draft is the Dutch one the partner asked for, not the English one the counterparty pushed.

None of those tasks need professor-grade legal reasoning. They need fast, careful, auditable comparison work. They need a system that knows the house template and can explain which line changed and why.

Where contract-review agents earn their keep

The agents that actually survive contact with a working firm do not try to be a junior partner. They do three things well.

First, redline triage. The agent ingests an inbound contract, locates the corresponding house clauses, and produces a single document with a colour-coded diff and a short note next to each change explaining what shifted and which way the risk moved. A senior associate then reads thirty pages of notes instead of two hundred pages of contract.

Second, portfolio search. The firm has fifteen years of agreements on a SharePoint that nobody can search. A retrieval pipeline that respects clause boundaries (rather than naive fixed-token chunking) lets an associate ask "every services agreement with a liability cap under €500k since 2022" and get a list in seven seconds.

Third, intake routing. New email from a client with a PDF attached. The agent classifies it, extracts the parties and the dates, opens a matter in the case-management system, and drafts the acknowledgement reply. Nobody types a subject line.

Takeaway

The Stanford benchmark measures reasoning. The hours a small firm wants back are comparison, search and routing. Build for the second list and the first one stops mattering.

Where the agent never goes

Three categories of work where every firm with a deployed contract-review agent we have looked at has drawn a hard line: drafting novel clauses, advising on litigation strategy, and anything that goes to a court or a regulator without a partner's name on it. The agent suggests; a human signs. There is no auto-approve toggle anywhere in a sensible product.

The Stanford finding does not change that line. A higher benchmark score does not lower the professional liability question, and the Dutch Verordening op de advocatuur still requires a lawyer to be accountable for the advice. The agent is a fast paralegal that never sleeps. It is not the lawyer.

The small-firm bottleneck

Magic Circle firms can afford a research team that evaluates every new model release. A six-partner firm in Eindhoven cannot. What that firm can do, with a focused agent on a clean workflow, is buy back the eight hours a week that the youngest two associates spend on NDA comparison and clause hunting. That is not a ten-times-faster story. It is roughly the difference between staying late on Tuesdays and going home at six.

The Stanford result is interesting because it points to a ceiling that keeps rising. The work inside these firms is interesting because the floor (the boring, repeatable, comparison-heavy hours) is already addressable today, with the models that shipped last year, if you build the workflow around them carefully.

One small audit for Monday morning

When we built the contract-review pipeline for a multi-office Dutch firm last year, the house templates lived in three Word versions across two SharePoints and nobody knew which was canonical. We solved it with a one-off consolidation script and a weekly diff report before any AI agents went near a single document.

Open a fresh document. For one week, ask every fee-earner in the firm to log the contract-touching tasks they did that they would describe as comparison, not judgment. Tally the hours at the end of the week. If the number is above ten, an agent will pay for itself before the quarter ends. If it is below three, do not buy anything.

Key takeaway

The Stanford benchmark measures legal reasoning under lab conditions. The work that pays off at SME firms is fast, auditable comparison.

FAQ

Does the Stanford result mean AI can replace lawyers?

No. The study measured isolated reasoning tasks, not the regulated practice of law. Lawyers stay accountable for advice. Agents handle the comparison and search work around it.

What does a contract-review agent actually do day to day?

Redline triage on inbound contracts against your house templates, fast retrieval across the historical archive, and intake routing for new client mail. Drafting novel clauses stays with humans.

How long does setup take at a typical SME firm?

Two to six weeks. Most of that is consolidating the house templates and indexing the historical archive. The agent itself is the last stretch of the project, not the first.

What about hallucinations on legal questions?

Real risk. We constrain the agent to retrieval over the firm's own documents and force every suggestion to cite a source clause. The human signing the work still verifies before it leaves the building.

ai agentsragknowledge baseworkflowstrategyoperations

Building something?

Start a project