Tooling

Shadow AI in a law firm: three bots, one NDA, three calls

A 47-person Antwerp law firm built eleven private contract-review bots over eighteen months. The day three of them disagreed live, the partner shut all the laptops.

Jacob Molkenboer· Founder · A Brand New Company· 10 Jun 2026· 9 min

Three brass desk bells clustered on ivory paper beside a folded cream NDA with green ribbon and red wax seal.

A senior partner at a 47-person Antwerp law firm is standing at the head of an oak conference table on a Tuesday morning. Across from him sits the general counsel of a Belgian shipping group, and a fat stack of NDAs that need to clear by Friday. The partner has three associates in the room. Each one of them, over the past eighteen months, has quietly built their own contract-review tool on top of a chat LLM and a spreadsheet. This is shadow AI walking into a paying client's deal, and the partner does not yet know it. He asks each associate to flag concerns on clause 12.4 of the lead NDA. He gets three different answers, all confidently delivered, on the room's main screen, in front of the client.

This is a real call we got the following week. We are going to walk through what we found, what we changed, and what we tell every firm that has crept into the same shape without noticing.

How a cottage industry actually forms

Nobody at the firm sat down and decided to build an internal AI program. There was no steering committee, no budget line, no procurement memo. Shadow AI never gets approved. It accretes, the way cottage industries always do, one frustrated specialist at a time.

It started with Anouk, a junior in M&A. She was buried in NDA redlines and discovered that a chat assistant could turn a 14-clause NDA into a list of deviations from the firm's standard template in about ninety seconds. She built a Google Sheet with two columns: "paste NDA here" and "paste flags here". She used it on roughly forty deals before mentioning it at a Friday drinks.

Three months later, Jeroen in tax was running his own version. His was different in three ways: he used a longer system prompt, he had loaded the firm's internal redlining guide as context, and he had wired the spreadsheet to a Python script on his laptop that he ran from a desktop shortcut shaped like a tiny gavel. Pieter, in employment, had a third version, which was actually a Slack bot in a private channel where only he and two paralegals had access.

By the time the Antwerp meeting happened, there were eleven of these things in the firm. Eight ran on Claude. Two ran on a competitor. One had drifted onto a free tier of something the IT department had never approved. None of them shared a system prompt. None of them shared a redlining standard. None of them logged their outputs.

That is the shape: a firm full of careful, intelligent operators, each individually optimising, all unconsciously building a parallel infrastructure with no contracts between the parts.

The three contradictory flags

The clause that broke the meeting was a fairly normal residual-knowledge carveout. It said, in summary, that information retained in unaided memory by personnel of the recipient would not constitute a breach. Three associates ran their three bots. Anouk's bot flagged it as "non-standard and a likely concession to the counterparty". Jeroen's bot flagged it as "industry-typical for shipping and inland-vessel work, recommend accept". Pieter's bot flagged it as "ambiguous, escalate to partner before signature".

None of those answers is wrong in isolation. The first is the firm's M&A house view from 2024. The second is the tax practice's view, which had inherited a redlining standard from a former lead partner who used to run shipping deals. The third is the employment practice's view, which defaults to escalation any time a clause touches retained employee knowledge.

The bots were not hallucinating. They were doing exactly what they had been built to do. The problem is that each one had been built by someone who reasonably assumed their practice's house view was the firm's house view, and the firm had never written down which one was correct.

Takeaway

When three internal AI tools give three different answers, the model is almost never the problem. The problem is that three humans inside the building had three different opinions, and you only noticed when the bots said them out loud.

What broke in the client meeting

The partner did the only thing he could in the moment. He closed all three laptops, apologised for "a tooling issue", flagged the clause manually from his own memory of the standard, and moved on. The general counsel was polite about it. She also asked, on her way out, whether the firm "had a view on how it was using AI". That question, asked in that tone, on the way out of a meeting, is the sound of a referral being put on probation.

Within a week the firm had a CIO who had never bought managed AI infrastructure before, three associates who felt personally responsible for embarrassing the partner, and a managing partner who wanted a one-page memo explaining how this had happened.

The honest answer fits in three lines. Eleven people built private tools. Nobody owned the shared layer. The first time the shared layer was load-bearing, in a client meeting, it failed.

Why banning shadow AI does not work

The wrong reflex, and the one we see firms reach for first, is to issue a policy memo banning unsanctioned AI use and roll out a single approved tool from one vendor. We have watched two firms try this. In both cases the cottage industry simply went further underground. Associates kept their private spreadsheets and stopped mentioning them at drinks.

The reason is straightforward. Anouk's tool worked. It saved her three hours per NDA. Telling her to stop using it, without replacing it with something that also saves her three hours per NDA, is not a policy. It is a productivity tax.

The legal exposure around shadow AI is also no longer theoretical. Courts and regulators across Europe are starting to treat tools that speak confidently in your name as if they are speaking as you, which for a law firm advising clients is a category of risk that has to be priced in. The failure modes overlap neatly with patterns catalogued in the OWASP Top 10 for LLM Applications, particularly the overreliance and excessive-agency categories. The fix is not to silence the tools. The fix is to make sure they are speaking the firm's view, not eleven different practice areas' views.

What a shared internal layer actually looks like

When a firm asks us how to clean up shadow AI, we draw the same picture every time. It has four parts, and the order matters.

One canonical knowledge layer

Before you touch a single bot, you write down the firm's house view. Not all of it, just the parts the bots are getting wrong. For the Antwerp firm, that was a redlining standard, eighty-three clauses long, in plain Dutch, signed off by the four practice heads. We helped them draft it. It took six weeks. It is the single most important artefact in the entire migration.

This is the document the bots read. Not the system prompt. Not the model. The document. Models will get swapped out twice a year forever. The document is the thing that has to be true.

One inference path

Eleven private API keys become one shared service. Every associate's tool calls the same endpoint, with the same retrieval layer pulling from the same redlining standard, with the same logging. The associates can still build their own interfaces on top. They cannot build their own brains underneath.

# What every associate's tool calls now
curl https://internal.firm.local/review \
  -H "Authorization: Bearer $FIRM_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "document": "",
    "practice": "ma",
    "client_id": "shipping-group-be",
    "reviewer": "anouk"
  }'

The endpoint returns the flags, the clause references, the version of the redlining standard it used, and a log id. If two associates disagree, the log id resolves the disagreement in five minutes, not three weeks.

One audit trail

Every call is logged with the input, the output, the model version, the prompt version, the retrieval snapshot, and the human who triggered it. This is non-negotiable for a regulated profession, and it tracks closely with what frameworks like the NIST AI Risk Management Framework expect when an AI is sitting inside decisions that matter. It is also non-negotiable for the next awkward call: you want to be able to ask "which version of the standard did the bot read on the day of the meeting" and get an answer in seconds.

One escape valve

Associates are allowed, expected even, to disagree with the bot. The interface has an "I'm overriding this" button that captures their reasoning and feeds it back into the standard's review queue. The eighty-three-clause document is not frozen. It is a living artefact that grows whenever a real lawyer pushes back on what the bot said.

Warning

The most common failure mode we see is firms building the inference path before writing the canonical knowledge layer. You end up with one bot that gives one consistent answer, which is also consistently wrong. Write the document first.

What this cost the firm, roughly

The honest accounting: six weeks of senior associate time to draft the standard (about 180 hours, split four ways across practice heads). Four weeks of engineering to stand up the inference path and the audit trail. Two days to migrate the eleven existing tools, whose owners were genuinely thrilled to throw away their private API keys and stop paying for them on personal cards.

The thing that pays for all of it is not the saved engineering hours. It is the next time a clause comes up in a client meeting and three associates open three interfaces and get the same answer. The general counsel did not ask the firm to stop using AI. She asked whether the firm had a view on how it was using AI. The shared layer is what lets the firm say yes.

The pattern is not specific to law

Shadow AI looks the same in every industry we have walked into it: a Rotterdam logistics company with six homemade route-planning bots and four answers, a Lisbon agency where everyone had a personal prompt library and three of them contradicted each other on tone of voice, a German B2B SaaS where sales engineers had each built their own quote-drafting assistant and at least one was leaking proposal text into a free tier. The names change. The shape does not.

If you are a founder or an operations lead and you are not sure whether you have a cottage industry, the test is two questions. First, ask any three people in your company to perform the same task with their AI tools and compare answers. Second, ask your finance team how many AI subscriptions are sitting on personal expense reports. If question one gives you three different answers and question two gives you a number above four, you are already a law firm in May 2026. You just have not had the meeting yet.

What you can do today

When we built the shared review layer for the Antwerp firm, the thing that surprised us most was how little of the work was engineering. It was sitting in a room with four practice heads and getting them to write down what they actually believed. That is the part of any AI agents build that decides whether the output holds up in a client meeting.

Open a fresh document. Title it "house view, v0.1". Write down the five questions your team gets asked most often, and the answer your most senior person would give. Then go find the three private tools you already know about, and ask whoever built them whether the tool would have given that same answer last Tuesday. Whatever gap shows up is your starting point.

Key takeaway

When three internal AI tools disagree, the model isn't the problem. Your firm has three undocumented opinions and only just noticed.

FAQ

What is "shadow AI" and why does it matter for regulated industries?

Shadow AI is unsanctioned AI tooling built bottom-up by individual employees without governance, audit, or a shared knowledge base. In regulated work, it means client-facing answers can drift from your firm's actual position.

Should we just ban personal AI tools at our company?

Bans without replacements push usage underground. The pattern that works is replacing private tools with a shared layer that is faster than what people built privately, then auditing what your firm's actual view on the recurring questions is.

What is the minimum version of a shared internal AI layer?

One canonical document of your house view on the top five recurring questions, one shared API endpoint that reads from it, and one audit log. Models and interfaces can vary. The document and the log cannot.

How long does cleaning up internal AI sprawl take?

For a firm of about fifty people: six weeks to draft the canonical knowledge document with practice heads, four weeks to build the shared inference path and audit trail, two days to migrate the existing private tools onto it.

toolingai agentsoperationsworkflowstrategyarchitecture

Building something?

Start a project