Operations

AI pilots for small manufacturers: a week-one selection method

Thirty people. One factory floor. One week to pick the pilot that will or will not survive day ninety. Here is the scoring sheet we actually use.

Jacob Molkenboer· Founder · A Brand New Company· 7 Jun 2026· 6 min

Brass scale, folded blueprint, green index card, clipboard, pencil, red wax seal on ivory paper, side light.

It is Monday, 9:15. We are at a sheet-metal shop in Brabant. Twenty-two people, two CNC machines, one part-time bookkeeper who is six months from retirement. The owner has been reading about AI agents on LinkedIn for eighteen months. He wants one of them to start working by Friday. We have five working days, and one of us has to tell him which single task to pick.

We do not pick by demo quality. We pick by a scoring sheet that has lived through about forty of these weeks. It scores two things: hours saved per euro of build cost, and rollback ease at day ninety. The second number is the one most studios skip. It is also the one that decides whether the pilot survives the first quarter.

The list, before AI gets mentioned

The week starts with paper. We walk the floor with the operations lead and write down every desk task that happens more than once a week. Not "we could use AI for X." Just the task: who does it, how often, how long, what triggers it, what ends it.

A typical list from a small manufacturer looks like this:

Quote requests by email, four to eight per day, about fifteen minutes each.
Order confirmations to suppliers, three to five per day, five minutes.
Stock recount after a job, daily, twenty-five minutes.
Invoice chase calls, Fridays, ninety minutes.
Material certificate filing, per shipment, ten minutes.
Production schedule updates, twice daily, twenty minutes.
"Where is my order" replies, five to ten per day, four minutes.
Damage photo logging, per incident, twelve minutes.

Ten to fifteen entries is normal. The list comes from the people doing the work, not from the owner. The owner's list always over-indexes on whatever annoys him personally, which is rarely the highest-value pilot.

The two-axis score

Each task gets two numbers, on a one-to-five scale.

Hours saved per euro of build cost. We estimate the annual hours we believe we can remove, then divide by the all-in cost of the build (our work, plus model spend, plus the small tooling around it). A score of five means a ratio above 0.04 hours per euro, roughly forty hours saved per €1,000 spent. A score of one means below 0.005.

Rollback ease at day ninety. If we turn the agent off at 09:00 on day ninety, how messy is the cleanup. Five means "switch off, nobody notices, no orphaned data, no external party affected." One means "two systems now expect this to run, the team forgot the manual process, the supplier is emailing the agent's address as if it were a person."

Multiply the two numbers. Pick the top one. That is your pilot.

The math is not the point. The discipline is. The score forces the conversation onto two things that matter and away from the things that do not: demo polish, novelty, what the owner saw at a conference last month.

Takeaway

A pilot with great hours saved and terrible rollback ease is not a pilot. It is a marriage.

The rollback sketch comes before the build sketch

We draw the rollback before we draw the agent. Two columns on a single page.

Left column: every system, person, and external party that will touch the agent. Right column: what each of them does on day ninety if we switch it off at 09:00.

For an invoice-chase agent at a small fabricator, the right column reads:

Accounting software: untouched, never modified.
Bookkeeper: receives the unpaid-list export by email at 09:05, same workflow as before.
Customers: get nothing from the agent address; we forward the inbox to the bookkeeper for two weeks.
Bank: untouched.
Internal Slack: the agent channel goes silent, no further messages.

If any row reads "we will need to migrate data" or "the supplier will need to re-onboard," the pilot is wrong. The rollback should fit on one page. If it does not, the agent has grown into the company faster than it has earned the right.

The kill switch is the first commit

Before we write the prompt, we write the off switch. A single environment variable, default false. A status endpoint the owner can hit from his phone. A log line that prints exactly what would have been sent, when the switch is off.

# week one, commit one
export AGENT_ENABLED=false
node agent.js --dry-run
# logs: "would send: 'Hi, a friendly nudge on invoice 2026-0412...'"

This is not novel. It is config discipline applied to agents, the same way the twelve-factor app rules treat configuration as a first-class input. And it is what Martin Fowler calls a release toggle: a flag whose only job is to let you turn the new behaviour off without a redeploy.

The owner sees the dry run on Tuesday. He approves the tone. We flip the switch on Wednesday morning. We have a working pilot by Thursday lunch, and the rest of the week is for the rollback drill, not new features.

Why hours-per-euro beats hours-saved alone

The temptation in week one is to pick the task that saves the most hours, full stop. That picks the wrong pilot for a 22-person company. The right pilot is the one with the best ratio, because a sub-30-person manufacturer cannot afford a build that needs a second build to maintain it. A €4,000 agent that saves 250 hours a year beats a €40,000 agent that saves 800 hours a year, on day ninety. The smaller agent leaves headroom for the second pilot in month two, then a third in month four.

This is also why we steer clear of the heavy stuff in week one. RAG over the engineering drawings, voice agents at the front desk, anything that needs a six-figure integration. Those come in month four or month six, after three small wins have earned them the budget and the goodwill.

Day ninety, what we actually measure

Three numbers, taken from the same spreadsheet the owner already uses.

Hours actually removed, not estimated. Pulled from a two-minute weekly check-in with the person whose task it was.
Net cost: build, plus model spend, plus our maintenance retainer.
Incidents: anything the agent did that needed a human to clean up afterwards.

If the real hours-per-euro beats 60% of our estimate, the pilot stays. If it does not, we run the rollback. Most studios never run the rollback, which is how a small manufacturer ends up paying €1,200 a month for an agent that quietly stopped being useful in week six.

We have run the rollback twice across fourteen agents in production. Both times the company was relieved, not embarrassed. The conversation was the same on both calls: "we are glad we know now, instead of in month nine."

The five-minute version

When we built an invoice-chase agent for a fabricator in Helmond, the thing we ran into was that the agent's reply-to address had quietly become the supplier's primary contact within three weeks. The day-ninety rollback test caught it. We redirected the address back to the bookkeeper before the cutover, and the supplier never noticed the change. That sort of small repair work is most of what good process automation actually is.

One audit you can do this afternoon: take the next three tasks your team complained about by 11:00, and write the off-switch description for each. The one with the shortest description is your week-one pilot.

Key takeaway

A pilot you cannot turn off cleanly on day ninety is not a pilot. It is a marriage. Score rollback first, hours saved second.

FAQ

How long should week one actually take?

Five working days, one of them on the factory floor. If the scoring sheet takes more than a day to fill, the task list was not concrete enough. Go back and walk the floor again.

What if the highest-scoring pilot is not the one the owner wants?

Show both scores to the owner. The decision is his. Our job is to make the rollback cost visible before he signs off, not to overrule the choice.

Why score rollback ease at day ninety specifically?

Ninety days is long enough for the agent to embed into routines, short enough that the original manual process is still alive in muscle memory. Most regret about an agent shows up here.

What is a fair build budget for a first pilot?

For a sub-30-person manufacturer, between €3,000 and €8,000 all-in. Anything above that needs a second pilot's worth of evidence to justify, which by definition you do not have yet.

operationsai agentsprocess automationautomationstrategy

Building something?

Start a project