AI agents
LangGraph vs CrewAI vs Workers: choosing an agent stack
Three agent frameworks, one Antwerp legal firm, one 02:00 incident. We scored LangGraph, CrewAI, and a hand-rolled Workers state machine on what actually mattered.

02:14, Tuesday. A senior associate at a 19-person legal-research firm off the Meir in Antwerp uploads a 47-page SPA. The contract-summary agent returns a clean executive summary, three risk flags, and a list of cross-references. Then it returns the same three risk flags. Then it stops responding.
The on-call engineer (us) opens the dashboard. The graph shows a node that succeeded but produced no output. A retry loop ate 38,000 tokens in 90 seconds. Nobody knows which version of the clause-classification prompt ran, because the prompt lives inside a Python decorator three files deep, and the partner who tweaked it last week did so via a Notion doc that an engineer then transcribed.
This is the post we wish we had read six months earlier, before we picked an orchestration layer. We rebuilt the same multi-step contract-summary agent three times: once on LangGraph, once on CrewAI, once as a hand-rolled state machine on Cloudflare Workers. Same inputs, same Claude model behind it, same evaluation set of 120 Dutch and English contracts. We scored each on three things that mattered at scale: can you debug it at 02:00, what does a single run cost in tokens, and who owns the prompt diff.
The pipeline we were orchestrating
The agent has five steps. Extract metadata (parties, dates, jurisdiction). Segment the contract into clauses. Classify each clause against a 34-category taxonomy the firm built over three years. Generate clause-level summaries. Roll those up into an executive summary plus a risk register. Each step is a separate LLM call. Some steps fan out (clause classification runs in parallel batches of 8). The whole pipeline takes 90 to 180 seconds end to end and costs somewhere between EUR 0.40 and EUR 1.20 per contract depending on length.
The evaluation set of 120 contracts was hand-labelled by two partners over four weekends, which means a regression on it costs more than engineering time. Treating that set as load-bearing changed every architectural decision that followed. If we could not measure a change against it within a single nightly cycle, the change did not ship.
This shape, small graph plus fan-out plus retries plus partial failures, is exactly what every agent framework claims to do well. So we let them all prove it.
Debuggability at 02:00
Frameworks die in production at 02:00. That is when the on-call story matters. We graded each option by what it took to answer one question: which prompt ran, with which inputs, and what came back?
LangGraph ships with hooks for LangSmith, which is the easiest path to traces. With LangSmith on, every node logs inputs, outputs, and the resolved prompt. Without it, you are reading Python stack traces and guessing which graph edge fired. The 02:00 experience is good if you pay for LangSmith and good engineers built the graph. It degrades quickly if the graph contains conditional edges built with lambdas, because the graph viewer cannot show you which branch was taken without replaying the run.
CrewAI is opinionated about agents and tasks, less so about observability. The default logs are chatty and human-readable, which feels nice until you try to grep them at 02:00 and realise the agent-to-agent handoffs are interleaved across stdout. You can plug in OpenTelemetry, but the integration is community-maintained and the spans are coarse. The mental model (agents with roles, talking to each other) makes the logs read like a script rather than a trace, which is charming in a demo and painful in an incident.
Hand-rolled on Cloudflare Workers means you wrote the state machine yourself, probably with Durable Objects or Cloudflare Workflows holding state between steps. The framework gives you nothing. You also have to give yourself nothing you do not want. We logged every state transition as a structured event into a single Workers Analytics Engine dataset, with the resolved prompt hash, the input token count, and the output. At 02:00, one SQL query tells us which prompt version ran. The catch is that you had to build this on day one. If you did not, you are in worse shape than either framework.
If your orchestrator cannot tell you when a step started returning shorter or hedged answers, you will not know a model has silently degraded until a partner complains. That is not a framework problem; it is the absence of a logging contract you should have written on day one.
Token spend per run
We ran the same 120-contract evaluation set through all three orchestrators, using the same Claude model and the same prompts (modulo framework-injected scaffolding, which is the whole point of measuring this). Median tokens per contract:
LangGraph 38,400 in 6,200 out
CrewAI 51,700 in 9,800 out
Hand-rolled 29,100 in 5,900 out
LangGraph's overhead comes from the system messages it injects to enforce structured outputs, plus slightly bloated tool definitions. Tolerable. CrewAI's overhead is the agent-to-agent chatter. When two agents collaborate, they exchange context, and that context is tokens. On a clean run with five steps, CrewAI was 32 percent more expensive than the hand-rolled version. On a run with one retry, it was 58 percent more. At 200 contracts a day, that is the difference between EUR 90 and EUR 145 per day in inference. Multiply by 220 working days a year.
The hand-rolled version is cheapest because nothing is injected you did not ask for. It is also the version most likely to silently skip a guardrail you forgot to add, which is the other side of the same coin.
Prompt diff ownership
This is the question every comparison post skips, and it is the one that decided the project.
The legal firm's senior partner does not write Python. She does, however, have strong opinions about how a force-majeure clause should be summarised, and she is right about them. The question is whether she can change that prompt without filing a ticket.
With LangGraph, prompts typically live as Python string constants or PromptTemplate objects inside node functions. Refactoring them out into a config file is possible and we have done it, but it is work. Until you do that work, every prompt change is a Python PR.
With CrewAI, agent backstories and task descriptions are arguments to Agent and Task constructors. Same story. You can externalise them into YAML (newer CrewAI versions encourage this), but the round trip of partner edits YAML, engineer reviews PR, CI runs, deploy still gates every change on engineering.
With the hand-rolled Workers version, we put every prompt into a versioned table in a D1 database with a tiny admin UI. The partner edits a prompt, hits save, the next contract run uses the new version. Every run records the prompt version hash. If the new prompt regresses on the 120-contract eval set (which runs nightly), the system pages us and reverts. The partner owns the diff. Engineering owns the rails.
Prompt-diff ownership is an org-chart question disguised as a tooling question. Pick the stack that puts the change in the right hands, then build the rails to keep that change safe.
The liability question
For a legal-research firm, AI hallucinations are not a technical curiosity. They are an exposure. If the contract-summary agent invents a clause that does not exist, and a partner relies on that summary in a negotiation, the firm is on the hook. Not the framework vendor. Professional indemnity insurers are starting to ask about this in writing.
The European conversation around generative-AI liability is converging on the same point: the deployer is responsible. The EU AI Act treats systems that materially influence legal decisions as high-risk by default, with the corresponding obligations on logging, human oversight, and post-market monitoring. That is not a 2028 problem. The load-bearing articles bite earlier, and your client's general counsel is already reading them.
That pushed two changes in our architecture, independent of which orchestrator we picked. First, every clause summary is required to cite the source span (start and end character offsets), and the UI renders the summary next to the original text. If the source span is missing, the summary is suppressed. Second, the executive summary is generated only from clause summaries, never from the raw contract. The model never gets to remember the contract. It can only assemble verified pieces.
You can build this in any of the three frameworks. It is easier to enforce in the hand-rolled version because you control the data flow between steps. In LangGraph you have to write the validator as a node. In CrewAI you have to teach an agent to enforce it, which is asking a model to police a model.
What we shipped
The hand-rolled state machine on Cloudflare Workers. Not because it is fashionable, and not because we have a problem with frameworks. The deciding factors, in order:
- The senior partner owns the prompt diff. That alone was worth the extra build cost.
- One structured log line per state transition gives us 02:00 traces without paying for a separate observability vendor.
- Token spend is 24 percent lower than LangGraph and 43 percent lower than CrewAI on the same evaluation set.
- The whole orchestrator (state machine, retry policy, prompt registry, eval runner) is about 1,200 lines of TypeScript. A new engineer reads it in an afternoon.
The trade-off is real. The frameworks give you a head start. LangGraph in particular is excellent if your team already lives in the LangChain ecosystem and you do not need partners to edit prompts; its checkpointing primitives are also useful when you have human-in-the-loop pauses that span hours. CrewAI is genuinely lovely for prototyping when you want a multi-agent feel and do not care about token economics. If you are at three contracts a day, pick whichever your engineers already know and ship.
At 200 contracts a day with a partner who wants a say, the math changes.
The five-minute audit
Open the file where your agent's prompts live. Time how long it takes to find the exact prompt that ran in production yesterday at 14:32 against contract ID 8821. If you can do it in under sixty seconds, your stack is fine. If it takes longer, the orchestrator question is downstream of a logging question. Fix that first. Once the trace is honest, the framework debate gets cheap: you can measure what each option actually costs you, instead of arguing about it.
When we rebuilt the contract-summary AI agent for this Antwerp firm, the thing we ran into was that the partner needed to own the prompts but engineering needed to own the rails. We solved it by versioning every prompt in a D1 table and running a nightly eval set against the last seven days of contracts, so a prompt regression pages us within twelve hours instead of three weeks.
Key takeaway
Pick the orchestrator that puts the prompt diff in the right hands, then build the rails (versioning, eval set, auto-revert) that keep that ownership safe.
FAQ
Should I always pick a hand-rolled orchestrator over LangGraph or CrewAI?
No. Under ten agent runs a day, the framework's head start is worth more than the token savings. Pick LangGraph or CrewAI, ship, and revisit the question when volume justifies it.
What about LangGraph's checkpointing for resuming long-running runs?
Useful when you have human-in-the-loop steps that pause for hours. We did not need it. Durable Objects on Cloudflare gave us equivalent state persistence with fewer moving parts.
Can a non-engineer really edit production prompts safely?
Only with rails. A nightly eval set against a frozen corpus plus automatic revert on regression makes it safe. Without those rails, the answer is no, no matter which orchestrator you use.
How long did the rebuild take end to end?
Six weeks of engineering for the orchestrator, prompt registry, eval runner, and admin UI. Three more weeks to migrate the firm's existing prompts and train partners on the editor.