← Blog

Chat agents

Chat agent rubric: Copilot Studio vs LibreChat vs SDK

A Friday-evening hallucination in a Monday brief is not a prompt problem. It is a platform-choice problem. Here is the three-axis rubric we use to pick the stack.

Jacob Molkenboer· Founder · A Brand New Company· 15 Jun 2026· 8 min
Three brass switchboard jacks in a row on ivory paper, green silk cord looping across, folded card with red wax dot.

Friday evening in an Antwerp kantoor

A paralegal at a twelve-person firm asks the new chat agent to summarise an arbitration award before a Monday brief. The bot answers in eight seconds and cites a 2019 Hof van Cassatie ruling that does not exist. The paralegal, tired, pastes the paragraph into the memo. The partner catches it on Sunday. Now the firm has a question that is not about prompts or tokens or whether the answer was "mostly right". The question is which of three platforms it should have run on in the first place, and who signs the file when the next hallucination slips past Sunday review.

We get asked to score that decision every few weeks for legal, accountancy, and notary practices. Below is the rubric we use. Three candidates, three axes, one matrix at the end. The firm we are scoring is paralegal-facing, twelve seats, roughly 380 dossier queries per week.

The three candidates

Microsoft 365 Copilot Studio is the path of least friction for any firm already living inside M365. You build the agent in a no-code canvas, point it at SharePoint and Outlook, and let Azure absorb the data-residency story. The tradeoff: you inherit Microsoft's roadmap, Microsoft's eval surface, and Microsoft's billing curve.

LibreChat is the self-hosted route. Open source, multi-provider, Postgres-backed. You stand it up on a VPS in Frankfurt or Brussels, point it at the model API of your choice, and run conversations against a database you own. You operate the box. You also operate the audit trail.

A custom build on the Claude Agent SDK is the bespoke option. You write the orchestration in TypeScript or Python, you pick the retrieval pattern (BM25 plus embeddings, hybrid rerankers, whatever fits), and you ship your own evals. More work up front. More lever at runtime, including when a regulator asks how the model decided something. The Anthropic agent docs are the right starting point if you go this way.

Axis 1: per-seat cost at 380 weekly queries

Twelve seats and 380 weekly queries is about 1,650 queries per month, or 140 per seat. Below are the numbers we landed on for one real firm earlier this year, all in euros, rounded to whole units. Pricing on all three platforms moves quarterly. Re-run the math before you commit.

Copilot Studio. Microsoft 365 Business Standard or Premium is the prerequisite (call it €13 to €22 per seat). On top, Copilot Studio runs on a metered "message" model: roughly €0.01 per simple message and €0.10 per generative answer, prepaid in packs. At 1,650 generative answers per month you are looking at about €165 in messages, plus a tenant-level platform fee for the standalone Copilot Studio license. Total all-in for twelve seats: about €430 to €520 per month, or €36 to €43 per seat. Pricing details on the Copilot Studio page shift, so treat the number as directional.

LibreChat + Postgres. One €60 to €120 per month VPS in eu-central, a managed Postgres for €25, and the model API bill. At 1,650 dossier queries with an average of 3,000 input tokens (the dossier excerpt) and 800 output tokens (the answer), Claude Sonnet runs around €25 to €40 per month in tokens. Total: €110 to €185 per month for the whole firm, or €9 to €15 per seat. The catch: somebody is operating it. Budget two to four hours of senior ops per month, which on a €120 internal rate is another €240 to €480.

Custom Claude Agent SDK build. Token costs are the same as LibreChat. Hosting is the same scale. The delta is build cost. A defensible first version for legal Q&A (retrieval from a curated dossier corpus, citation enforcement, a confidence-gated refusal path) is 60 to 120 engineering hours. Amortised over twelve months at our rates, that is roughly €40 to €80 per seat per month for year one, falling to €15 to €25 from year two onwards as only changes and evals are billed.

On pure spend, LibreChat wins year one. The custom build wins year three. Copilot Studio wins nothing on cost, but it never claimed to.

Axis 2: beroepsgeheim defensibility under the tuchtreglement

This is where the cheap answer becomes expensive. Belgian lawyers are bound by Article 458 of the Strafwetboek and by the Codex Deontologie of the Orde van Vlaamse Balies. The OVB-tuchtreglement is what actually disciplines a misstep. Three questions matter.

Where does the dossier text physically sit while the model reads it? Copilot Studio runs in Azure regions you select, and Microsoft offers EU Data Boundary commitments for tenant data. LibreChat sits wherever you provision it; pick a Brussels or Frankfurt VPS and the answer is "in the EU, in a server room you can name". The Claude Agent SDK build calls the Anthropic API, which means the request leaves your perimeter. Anthropic offers EU inference endpoints and a Zero Data Retention option under enterprise contracts, but a twelve-person firm typically does not qualify without going through a partner. That detail matters.

Can you prove, in a tuchtprocedure, exactly what the model saw and what it returned? Copilot Studio gives you the M365 audit log, which is useful but not granular about the model's intermediate reasoning. LibreChat writes every turn to your Postgres, which you can pg_dump and present. The Agent SDK build logs whatever you write, which is to say: everything, if you wrote it that way. We default to logging the retrieved chunks, the prompt, the tool calls, and the final answer with hashes, behind a seven-year retention policy.

Who is the verwerker, and what does the verwerkersovereenkomst actually say? Microsoft and Anthropic both publish data processing agreements. Read them. The clause that matters is the one about using customer data for model training. All three vendors say "no" on enterprise tiers in 2026. Two of them say "yes" by default on consumer tiers. A firm that signs up for the wrong tier is a firm that ends up in front of the tuchtraad explaining why dossier text was a training input.

Warning

If the contract you signed is the click-through one, you signed the wrong contract. Beroepsgeheim defensibility starts with a negotiated DPA, not a marketing page.

Axis 3: sign-off when a hallucination reaches the brief

This axis is about people, not tech. When a fabricated jurisprudence citation lands in a memo and the partner catches it on Sunday, there is a chain of accountability the OVB will reconstruct. We score each platform on how clear that chain is.

Copilot Studio: the firm's M365 administrator is the de facto owner. In a small practice that is usually the managing partner, who did not write the prompts, did not pick the retrieval, and does not have a runbook for evals. Accountability is technically present and operationally thin.

LibreChat: whoever operates the VPS owns it. If that is an outside contractor, your verwerkersovereenkomst with them needs to specify response times and a published changelog. If you self-operate with a senior associate moonlighting as ops, accountability sits with a person who already has a day job.

Agent SDK build: accountability is whatever you write into the build. We typically codify it. A named human approves prompt changes. A named human reviews the weekly eval report. A confidence threshold below 0.7 routes the answer to a "please verify" banner instead of a clean response. A failed citation lookup blocks the answer entirely. None of this is exotic. It is the difference between "the model said so" and "we made a deliberate choice to trust this output".

A simple gate looks like this in the Agent SDK build:

async function answerDossierQuery(q: string) {
  const hits = await retrieve(q, { topK: 8, minScore: 0.62 });
  if (hits.length === 0) return refusal("no_dossier_hits");

  const draft = await model.generate({ system: SYSTEM, q, hits });
  const cites = extractCitations(draft.text);

  for (const c of cites) {
    const ok = await jurisprudenceDb.lookup(c);
    if (!ok) return refusal("unverified_citation", { citation: c });
  }

  if (draft.confidence < 0.7) {
    return { kind: "please_verify", text: draft.text, hits };
  }
  return { kind: "answer", text: draft.text, hits };
}

That function is forty lines and it carries most of the defensibility load. You cannot write its equivalent inside Copilot Studio's canvas. You can in LibreChat if you fork it. You get it for free in an Agent SDK build because you wrote it.

The scoring matrix

We weight the axes 30 / 40 / 30 for legal practices. Cost matters but not most; defensibility matters most; sign-off matters because it is the only axis the regulator will ask you about by name.

For the Antwerp firm in question, the matrix scored as follows. Copilot Studio: 4 / 10 on cost, 7 / 10 on defensibility, 5 / 10 on sign-off. Weighted: 5.5. LibreChat: 9 / 10 on cost, 6 / 10 on defensibility, 6 / 10 on sign-off. Weighted: 6.9. Custom Agent SDK: 5 / 10 on cost (year one), 9 / 10 on defensibility, 9 / 10 on sign-off. Weighted: 7.8.

The firm picked the custom build. Two paralegals and one partner trained on it for a week. The agent now handles the bulk of weekly queries without escalation. The rest route to the "please verify" banner, and the human handles those in a fraction of the time a fresh dossier read would take.

Factors that flip the matrix

If the firm were sixty seats instead of twelve, Copilot Studio would close the cost gap because the tenant fees amortise. If the practice area were tax instead of litigation, LibreChat with a strict citation tool would be enough because the corpus is narrower and hallucinations are easier to gate. If the firm did no legal work at all and just needed a smart inbox triager, none of this would apply and we would have shipped a one-week prototype on whatever stack the IT lead already understood. The point of the rubric is to make those tradeoffs visible before the contracts are signed, not after.

It is also worth noting that the regulatory floor under all frontier models is moving. Vendor terms, EU AI Act timelines, and bar-association guidance shift quarterly, and the trade press this month is full of stories about which models will be allowed where. Bake a re-score into the engagement. We re-run the matrix every six months for clients in regulated practice and adjust the build, not the rhetoric.

The five-minute version for your firm

If you run a small legal, accountancy, or notary practice and you are about to sign a chat-agent contract this quarter, do this before you do anything else. Open a blank document. Write down three numbers: how many queries per week your team will run, how many euros per seat per month you are willing to spend, and how many hours of partner time per quarter you will spend reviewing audit logs. Then ask the vendor to put their answer to each in writing. If they cannot, you are buying a demo, not a deployment.

When we built the paralegal-facing agent for an Antwerp litigation firm earlier this year, the thing we ran into was not the model but the citation lookup tool that turned a confident hallucination into a blocked response. We solved it by treating the citation database as the source of truth and the model as a draftsman, and we now wire that same pattern into every AI agent we ship for regulated practices.

Key takeaway

Score chat-agent platforms for a small legal firm on three axes: per-seat cost, beroepsgeheim defensibility, and who signs off when a hallucination reaches a brief.

FAQ

How long does a custom Claude Agent SDK build take for a small legal firm?

60 to 120 engineering hours for a defensible first version: retrieval, citation enforcement, confidence-gated refusals, audit logging. Two to four weeks of calendar time including evals.

Can we host LibreChat on-premise instead of a VPS?

Yes. A small server in the firm's own rack handles 1,500 queries per month easily. You need a named person responsible for backups, security patches, and TLS certificate rotation.

Does the EU AI Act change which platform we should pick?

Not directly. All three can be made compliant. What changes is the documentation burden, which falls heavier on the firm than the vendor. Plan staff time for that, not just spend.

What if the firm already has a Microsoft 365 tenant?

Copilot Studio gets cheaper at the margin because licensing is already done. It does not become more defensible. Cost and defensibility are separate axes; do not let one absorb the other.

chat agentsai agentsarchitecturestrategyoperations

Building something?

Start a project