Tooling

Agent code audits: the checklist before a production rebuild

Three agency teams handed us repos where every check was green and the on-call roster doubled within a month. Here is the audit we now run before quoting.

Jacob Molkenboer· Founder · A Brand New Company· 27 Oct 2024· 9 min

Leather logbook beside a row of brass paper fasteners on ivory paper, one fastener tied with a chartreuse ribbon.

A founder messaged us on a Thursday night with a repo URL and a single question: how much to rebuild this on a stack we trust. The CI badge was green. Every deploy in his Vercel dashboard had shipped. The repo had eight months of commits, almost all of them carrying the same agent-harness signature. His team's on-call rota had been quiet for six weeks. Then production started timing out under load. Then a webhook stopped retrying silently. Then a cron job double-fired and charged six hundred customers twice in one morning.

This was the third repo we had seen in eight months with the same shape: green on the way in, surprise on the way out. We stopped quoting rebuilds blind after the second one. Now we run an audit first, and only then write a number.

The green-CI, red-pager gap

An agent harness, Codex or Claude Code or Cursor or your own homebrew, is optimised to make the failing test pass. That is not the same target as "survive production". The agent writes the test it had in mind, then writes the code that makes it pass, and CI lights up green. Nothing in that loop forces the agent to ask: what if the third-party API rate-limits us, what if this cron job runs twice, what if the database is in read-only mode for ninety seconds during a failover.

You can wire those questions into the loop yourself. Most teams do not, because the agent never complained and the demos kept landing. The gap shows up six months later as an on-call burden that looks like negligence, but is really just a category error: a test suite that knows the happy path and almost nothing else.

What we open first

Before we read any application code we open four things in order: the instructions file the harness reads (agents.md, CLAUDE.md, .cursorrules, whatever it is called this month), the harness session logs if any were kept, the git log in plain text, and one production error log from the last seven days. Each of these tells us more than the source ever will.

The agents.md convention made this audit faster. Before that, every harness had its own dotfile and half of them were undocumented. A repo without any instructions file at all is not automatically a red flag, but it does mean the agent was running on training defaults the whole way through. That tends to correlate with whatever defaults fit the day the project started, not whatever defaults fit the system in production today.

The instructions file

The single biggest tell of harness quality is the instructions file. We check four things. When was it last touched, relative to the last twenty commits. Who wrote it, the agent or a human. Whether it contradicts itself across sections. And whether it gives the agent examples, not just rules.

A bad instructions file is four thousand words of "always" and "never" with no code reference. The agent has to guess what "use clean architecture" or "prefer pure functions" means in this codebase, and it guesses differently every session. A good instructions file is short, points at three real files in the repo, and says "do it like this". The harness gets the same answer twice in a row.

If the file has not been touched in eight weeks and the codebase has shipped two features since, it is stale, and the agent has been making it up. That is recoverable. What is not recoverable is the file that says one thing in paragraph two and the opposite in paragraph nine. We have seen this enough times to expect it before we open the file.

Tool permissions and shell scope

What can the agent actually run? Bash without restriction, or only a list of npm scripts? Can it touch the filesystem outside the repo root? Does it have network access during builds? Has it ever pushed to main without a human review in between?

This is where shadow infrastructure starts. An agent with unrestricted shell will, over a long enough horizon, install a tool nobody else knows about, write a script to /tmp that becomes load-bearing, or quietly add a cron entry. None of these show up in the repo. All of them show up in the pager. We grep for shell-out patterns in the harness logs and ask the team a single question: name every binary the agent has installed in the last three months. If they cannot, we are not quoting a rebuild yet.

The test corpus

We do not look at coverage. Coverage is a percentage of lines exercised, not a percentage of failure modes covered, and agents are good at writing tests that hit lines without testing behaviour.

We look for three smells. Tests that mock the system under test, so the assertion is really "the mock returned what the mock was told to return". Tests that re-implement the source in the test setup, so any change to the source means the test still passes for the wrong reason. Tests with no assertions, only a no-throw guarantee.

// The pattern we see most often: an assertion on the mock.
test('sends invoice email', async () => {
  const sendEmail = jest.fn().mockResolvedValue({ id: 'msg_1' })
  await sendInvoiceEmail(sendEmail, fakeInvoice)
  expect(sendEmail).toHaveBeenCalled()
  // The real questions stay untested: what if sendEmail throws?
  // What if it returns a 5xx? What if it times out mid-flight?
})

Each of these smells is normal in human-written code at low volumes. Each of them becomes the default at scale when an agent is told "add tests for this file" without further constraint.

Warning

Coverage above 90% with zero tests for retry, idempotency, or timeout handling is the single most common pattern in repos that pass CI and double the on-call load.

Git hygiene and the shape of commits

Agent commits have a shape. They tend to touch eight to fifteen files at once, with a one-line message and a thirty-line diff. Human-reviewed agent commits look different: smaller, longer messages, with revert commits when something does not work out.

We do a quick read of the last fifty commits: who authored them, how big each diff was, and how many times the chain "fix X" then "fix X again" then "actually fix X" appears. The fix-the-fix chain is the clearest signal that the agent was looping without a human telling it to stop. Three or more of those chains in fifty commits and we know what the next six months looked like.

Observability hooks

We open one production log from the last seven days. We are not looking for the error. We are looking at the shape of the logging itself. Is anything caught and rethrown with context, or is everything either swallowed or crashed-on? Are there metrics for request latency, queue depth, retry count? Are there traces that cross the boundary between the agent's code and the third-party calls it makes?

Most agent-built systems we have audited do not have any of this. The agent was never asked to add observability, because no test failed without it, and the human reviewer did not think to ask either. The twelve-factor guide on treating logs as event streams still holds up here, twelve years on. If the answer to "how do you know the system is degraded versus broken" is silence, we add an observability week to the quote before we touch anything else.

The runbook nobody wrote

Is there a runbook? Has anyone used it? If the database goes into failover, what happens? If the upstream API returns 429 for an hour, what happens? If the cron worker dies mid-job, what happens to the job that was running? The Google SRE book is older than most agent harnesses, and its chapter on managing incidents still describes the gap exactly: the team that wrote the system was never the team paged when it broke.

The runbook absence is rarely the agent's fault. The agent was never on call. It does not know what your team needs at 3am because it has never been there. This is also the section that is fastest to fix, because once you list the scenarios, the agent is actually quite good at drafting the response steps.

Scoring and what we do with it

We score each of the seven sections above out of five and add them up. Below fifteen out of thirty-five, we will not quote a rebuild without a discovery week first, because the unknowns are too large. Between fifteen and twenty-five, we quote a refactor with a clear migration path, usually less work than the client expected. Above twenty-five, we tell the team they have something worth keeping and we quote a stabilisation engagement instead. About one repo in five lands above twenty-five. The other four were the reason the founder messaged us in the first place.

The five-minute version

If you cannot run the full audit, run three steps. Read the instructions file aloud and ask whether it would be useful to a new hire who joined yesterday. Read the last fifty commits and count the fix-the-fix chains. Open one production log from the last week and ask whether you can tell, from the log alone, what the system was doing when something went wrong. If any of those three is empty, missing, or unreadable, you already have your answer about the rebuild.

When we audited the agent stack for a Rotterdam logistics client this spring, the surprise was in the test corpus: 94% coverage, zero retry tests, and a cron job that had been silently double-firing for six weeks. We did not rebuild. We added retry tests, a deduplication key on the cron, and a thin observability layer, and they closed two recurring incidents in the first week. If that pattern sounds familiar, the same checklist is what we run on every AI agent engagement we take on.

The smallest thing you can do today, before any audit: open the instructions file your harness reads, and check the date on the last commit that touched it. If it is older than the last feature you shipped, your agent has been guessing for that long.

Key takeaway

Coverage above 90% with zero tests for retry, idempotency, or timeout is the single most common reason agent-built code passes CI and doubles the on-call load.

FAQ

What counts as an agent harness?

The runtime that wraps a coding model and decides which tools it can call: shell, file system, git, network. Codex, Claude Code, Cursor and Aider are all harnesses with different defaults.

How long does a full audit take?

A first read takes a senior engineer four to six hours for a mid-sized repo. Scoring and the write-up adds another half-day. We bill it flat so the result is not pressured.

When should we rebuild versus refactor?

Below 15/35 on the checklist, rebuild. Between 15 and 25, refactor with a migration path. Above 25, stabilise and add observability. The test corpus and the runbook drive the answer more than the language choice.

Does agents.md actually change outcomes?

Yes when the file points at real examples in the repo and is updated alongside features. No when it is generic rules with no code references, which is how most of them start.

ai agentstoolingoperationsworkflowarchitectureprocess automation

Building something?

Start a project