Strategy

AI agent onboarding: keep your team's judgement sharp

A Berkeley CS classroom is not your operations team. But the warning is the same. Roll out an agent the wrong way and your sharpest people get duller.

Jacob Molkenboer· Founder · A Brand New Company· 9 Apr 2024· 6 min

Brass compass, leather notebook, cedar pencil, green ribbon, red wax seal on ivory paper by a window.

A Tuesday morning at a twelve-person logistics firm in Rotterdam. The operations lead is sitting next to her newest hire, reviewing twelve customer emails an agent drafted overnight. The lead approves eight. The new hire would have approved twelve. That gap is the whole game.

That gap is also what the Berkeley CS grades story going round this week is really about. Failing grades climbed alongside AI usage. Mathematicians sent up a parallel flare earlier this year. You can argue the methodology, but the observation underneath is hard to dodge. When a tool produces a passable answer, people stop building the judgement that tells passable from right. The same shape shows up in the Stanford HAI AI Index: usage curves running well ahead of the skill curves they were supposed to lift.

Why this is an SME problem, not a classroom problem

In a Berkeley course, the cost of a student leaning on an AI shows up two years later in a job they may or may not get. Delayed and diffuse. In a twelve-person operations team it shows up in three weeks. A customer asks something the agent did not catch. The teammate who should have caught it does not, because they stopped reading carefully on day nine. Now the customer is annoyed and your operations lead is fixing it on a Saturday.

What you are training for, then, is not "how to use the agent". It is how to stay sharper than the agent on the parts that matter. The brief is small, and that is the point.

The four-week rollout we actually run

When we put an agent inside a small team, the rollout runs four weeks. Not because four weeks is magic, but because it is enough to see two full cycles of the work and two cycles of how the team reacts to it.

Week one is read-only. The agent runs in the background, drafts everything, ships nothing. The team reviews drafts at end of day. We collect every override and we read them together on Friday. The point is not the agent yet. The point is to surface what the team's judgement actually looks like before it gets blunted, so we have a baseline to compare to.

Week two is one-eye-on. The agent ships drafts to a staging queue. A teammate approves or rewrites each one before it leaves. We measure approval rate, override rate, and time-per-item. The numbers themselves are less interesting than how they move from Monday to Friday. A flat override rate is the warning sign, not a high one.

Week three is paired. Half the team works with the agent. Half works without. We rotate at midweek. The point is to keep both modes in muscle memory. A team that only ever works with the agent will lose the ability to work without it, and the day the agent breaks (it will break) you will find out the hard way.

Week four is supervised solo. The agent ships directly. The team reviews a sampled slice. We pick the slice deliberately to include the edge cases the agent struggled with in weeks one and two. The teammate reviewing knows which slice is sampled, which means they bring more attention to it.

What we measure to know judgement is intact

A rollout that looks good on time-saved can hide a slow rot in the team. We track a handful of things on a weekly basis.

Override depth, not just override rate. A team that flips a yes to a no once a week and otherwise rubber-stamps is rotting faster than one with a steady 20% rewrite rate.
The time-on-task curve. If it falls every week, that can mean efficiency. It can also mean the team is reading less. We sample five items at random and ask the reviewer to explain their reasoning out loud. If they cannot, we slow down.
Recovery drills. Once a quarter the agent goes down for a day. On purpose. A team that grinds to a halt is a team we have over-leaned on the agent.

The same logic shows up in Anthropic's research on how they contain Claude across products. Containment is not just a security posture. It is a training posture. You decide where the agent is allowed to act unsupervised, and that decision is also where your team's judgement gets either exercised or shelved.

Takeaway

The Berkeley story is not an argument against agents. It is an argument against shipping them without a plan for what they are eroding while they save you time.

The signal hidden in the Uber number

A different story from the same week is worth a look. Uber reportedly capped internal AI spend per engineer at roughly $1,500 a month, and the discussion read it as a market signal for AI tool pricing. There is a second signal in there that matters more for SME leaders. Uber is large enough to draw a line where the marginal AI dollar stops returning marginal judgement. Most SMEs are not yet asking that question. They are still in the "more agent, more better" phase.

At twelve people, the real cap is not the budget. It is the number of judgement-hours your team has available to oversee the work the agent ships. If your agent is processing four hundred emails a day and your team has six judgement-hours to spread across them, you are running a lottery, not a workflow.

The weekly review you hand off to the team

Here is the structure we leave behind for teams to run themselves once we are off the project. Forty minutes on a Friday. Pull a random ten from the week's agent output. For each one, the teammate who would have shipped it says out loud what they would have changed and why. Note the answers in a shared doc. Once a month, read the doc end-to-end and look for patterns of "I would not have caught that". Those are your training topics for the next month. Once a quarter, run the recovery drill. Agent off for a day. Watch what breaks.

That is the whole thing. No dashboard, no Notion template, no consultant. A team that runs this for a quarter knows more about its own judgement than most companies running agents today.

The smallest move you can make this week

If you already have an agent live and a team smaller than thirty, do this on Friday. Pull the last week of agent outputs. Pick ten at random. Sit with the teammate who would have shipped each one and ask what they would have changed. Note the answers. Read them on Monday. That hour is the cheapest training data you will collect this quarter and the most honest signal you have on whether the agent is sharpening your team or dulling it.

When we built the inbox-triage AI agents for a Dutch logistics SME earlier this year, the thing we ran into was exactly this. The team's override rate plateaued at week three and we nearly called the rollout done. Running the ten-item audit on a Friday is what surfaced the rot before it reached a customer. Those forty minutes a week are the part of the work we now refuse to skip.

Key takeaway

The Berkeley grades story is not an argument against agents. It is an argument against shipping them without a plan for what they erode while they save you time.

FAQ

How do you tell if an agent is rotting your team's judgement?

Watch override depth, not just override rate. Track the time-on-task curve. Run a planned agent outage once a quarter. A flat override rate that holds for three weeks is the warning sign.

How long should agent onboarding take for a small team?

Four weeks. Week one read-only with the agent drafting only. Week two one-eye-on with staging approval. Week three paired, half the team with the agent and half without. Week four supervised solo with deliberate edge-case sampling.

What is the smallest habit that protects team judgement?

On Friday, pull ten random agent outputs. Sit with the teammate who would have shipped each one. Ask what they would have changed. Read the answers on Monday. Forty minutes a week, no tools required.

ai agentsstrategyoperationsworkflowprocess automation

Building something?

Start a project