AI agents

Agency stack scorecard: agent fleet vs Next.js vs Retool

Three klantbriefs on the kitchen table, six devs, one junior starting in maand 2. A three-axis method for picking the stack per project, not per agency.

Jacob Molkenboer· Founder · A Brand New Company· 14 Jan 2026· 6 min

Wooden telephone switchboard with brass patch cables, one green cord, folded letter, ink ledger, red wax seal on ivory paper.

It is a Thursday evening in mei. You run a Utrecht agency: six devs, €4.2M revenue trailing twelve. Three klantbriefs are on your kitchen table. A 3PL wants an invoice-chaser. A contractor wants an internal portal that talks to AFAS. A private-label brand wants their marketing-ops rebuilt because their current Zapier graph has 137 zaps and nobody can explain it. All three need to ship before Q4. Two devs are on holiday in week 3. The junior starts maand 2. And you have to decide, tonight, what each project ships on.

Since the latest round of HN threads pushing “AI-native” as the default for new builds, half of your inbox is asking whether they should be Claude-Code-fleeting everything. The other half is still on Retool-builds-everything. Neither view is wrong. Both are useless without a method.

This is the method we use at ABN to pick the stack per project, not per agency. It came out of fourteen agents in production and a handful of rescues from teams who shipped the wrong stack on the wrong brief.

The three honest descriptions

Strip the marketing. A Claude Code agent-fleet is a Git repo full of agent specs, a dispatcher, a tools layer (your APIs, your DB, your filesystem), and an orchestrator. The agent does work humans used to do — triage, chase, classify, draft. It loops. It calls tools. It writes back. Anthropic's Claude Code docs describe the runtime, but the operational model is yours to invent. Cheap to start. Expensive to debug at 11pm on a Friday.

A hand-rolled Next.js + Postgres app is what your dev team already knows. App Router, Drizzle or Prisma, Postgres on Neon or Supabase, a queue if you need one, deploy on Vercel or Fly. Deterministic. Boring. You own every line. The bus factor matches your headcount.

A Retool-on-steroids setup means an internal-tool platform (Retool, Budibase, ToolJet) as the front, your APIs and DB as the back, workflow runner attached. Closed-source UI, open-ish glue. Cheap per screen, expensive per surprise. The Retool docs are honest about where the platform ends and code begins.

All three can ship the same brief. The wrong one will eat your marge.

The three axes that decide it

We score every project on the same three numbers. Each is a 1–10. Each comes from a specific question, not a vibe.

Per-project marge across six devs

Not “is it cheap to build.” The right question is: how many of our six devs can ship a feature in this stack in week 2 without a senior pair? Six out of six is a 10. One out of six (your one senior) is a 1. Stacks that only your best dev can ship in are silently subsidised by that dev's salary across every other project they don't touch. That is where marge leaks out without you noticing.

Klant-handover defensibility

The brief: a junior leaves in maand 4. Their replacement starts maand 5. Can the replacement extend the system in their first sprint without your senior babysitting? A 10 means yes, with public docs. A 1 means the system has bespoke tribal patterns and unwritten orchestration rules. This is the axis where agent-fleets quietly lose points unless you have written your own runbook layer. The framework is new enough that there isn't yet a Stack Overflow you can hand a junior.

Runbook ownership on a vrijdagmiddag

It is 16:45 on a vrijdag. The agent loops. It has called send_invoice seven times to the same debtor. Klant calls. Who answers the phone, who has the kill switch, who reads the logs? If the answer is “we figure it out,” the project does not ship on that stack. Score 1. If the answer is one named human with a pinned runbook, a tested kill switch, and a known SLA, score 10. This axis is non-negotiable for any stack that runs autonomously between deploys — which both the agent-fleet and the Retool-workflow options do.

Warning

If you cannot name the runbook owner before the project starts, the stack does not matter. The project will fail on every stack equally.

The scoring sheet

We keep this in a YAML file in our intake repo. One file per project, one minute per axis, three minutes to a decision. Weights are not equal. Marge is the loudest number. Defensibility is a tax that compounds across years. Runbook ownership is a gate, not a score.

# intake/scorecard.yaml
project: logistics-invoice-chaser
client: 3pl-rotterdam
deadline: 2026-09-30

axes:
  marge_across_6_devs:
    next_js: 8       # whole team ships
    agent_fleet: 5   # one senior + one mid
    retool: 9        # any dev, any junior
    weight: 0.5

  klant_handover_defensibility:
    next_js: 8       # standard stack, public docs
    agent_fleet: 4   # tribal, no junior pipeline
    retool: 6        # vendor-locked UI
    weight: 0.3

  runbook_ownership:
    next_js: pass    # deterministic, no autonomy
    agent_fleet: BLOCKED  # no named owner yet
    retool: pass     # human-in-the-loop
    weight: gate

decision:
  formula: 0.5*marge + 0.3*defensibility, gate on runbook
  next_js: 0.5*8 + 0.3*8 = 6.4
  agent_fleet: BLOCKED
  retool: 0.5*9 + 0.3*6 = 6.3
verdict: next_js   # marge edge, no autonomy needed for v1

The formula matters less than the gate. Almost every disaster we have cleaned up was an agent-fleet that shipped without a named runbook owner. The kill switch existed in slides. It did not exist in production. By the time someone read the loop logs, the third invoice had landed in the klant's debtor's inbox and the relationship was on fire.

How the three briefs scored

Run the sheet against the kitchen-table briefs.

The 3PL invoice-chaser looks like an agent-fleet poster child, and on Reddit it would ship that way. On the scorecard it failed the runbook gate. Six devs, no named ops lead, no kill-switch infra, no SLA conversation with the klant. We shipped it as Next.js + Postgres, with a “draft, human approves, send” loop. Boring. Marge held. We added the agent layer in v2 once the ops handoff was real.

The contractor portal with AFAS got Retool. Five internal users, twenty screens, no autonomy, the klant's own IT team already used Retool. Defensibility was 9 because the klant could maintain it. Marge was 9 because two of our juniors built the bulk. The AFAS adapter went into a tiny Next.js sidecar under our maintenance contract, which is the only seam where Retool stops being honest about itself.

The marketing-ops rebuild got the agent-fleet, but only because we negotiated a named runbook owner on the klant's side, charged a small premium for a 24/5 on-call rotation, and shipped the Zapier replacement as a Claude Code orchestrator with hard limits on tool calls per hour. The runbook lives in their Notion, owned by their head of growth, with a kill-switch endpoint that disables the agent and triggers a Slack page.

When the method punts

The scorecard is honest about what it cannot price. It does not price klant prestige (some projects you ship at thin margin because of the logo). It does not price your team's appetite — if your three best devs will quit on you when you ship another Retool, score that. It does not price strategic learning. Agent-fleets in 2026 are still a moat builder for an agency that wants to hire ahead of the market, and a 0.4-point loss on marge can be worth it for the practice.

What the method does do is keep the conversation honest. You can disagree with the score, but you have to disagree with a number, not a feeling.

If you do one thing today, run the three-axis check on whatever project is closest to start. Open a YAML file. Score the three stacks across marge, defensibility, runbook. Notice which one is blocked. When we built the marketing-ops agent for that private-label brand, the thing we almost missed was the runbook gate; we ended up solving it by writing the kill-switch endpoint into the klant contract before the kickoff. That kind of plumbing is what we work on as AI agents for sub-€8M agencies and their klanten.

Key takeaway

Pick the stack per project, not per agency: score every brief on marge across six devs, junior-handover defensibility, and who owns the Friday-afternoon runbook.

FAQ

Why score per project instead of standardising the agency on one stack?

Because the three axes shift per brief. A portal with no autonomy and a Retool-literate klant is a different math problem from an autonomous invoice-chaser with no named ops owner.

Why is runbook ownership a gate and not a weighted score?

Because an autonomous system without a named owner does not fail gracefully. It fails on a Friday afternoon and burns a klant relationship. A gate forces that conversation before kickoff.

Where do agent-fleets actually win on the scorecard?

On projects with a named klant-side runbook owner, a real ops budget, and work that humans currently do at scale. The marge upside shows up in v2 once the orchestration layer is reused.

Does the scorecard work for solo founders or only for six-dev agencies?

It scales down. Replace the marge axis with hours-per-week-you-can-afford-to-maintain-it. The defensibility and runbook axes stay the same; both still bite a solo founder, often harder.

ai agentsautomationstrategyarchitectureoperationsbusiness

Building something?

Start a project