AI agents

Code-review agent: posting only where it's never wrong

A code-review bot is only useful if the team trusts it. We shipped one into a 38-person Dutch SaaS that only comments on the four file types where it has never been wrong.

Jacob Molkenboer· Founder · A Brand New Company· 6 Jun 2026· 8 min

Brass relay switch with four closed contacts beside a cream form with green wax seal on dark cloth.

It was a Friday afternoon at a 38-person SaaS in Utrecht. The pull-request queue was eight deep. A senior backend engineer had just right-clicked the new code-review bot's comment, marked it as outdated, and added a personal mute rule to his IDE. It was the third time that week. The bot had flagged a Tailwind utility class as "deprecated." It wasn't.

The team had been live with the bot for nine days. In that window it had posted 412 comments across 47 pull requests. Nineteen of those comments had been actioned. The other 393 had been dismissed, marked resolved without a code change, or quietly ignored. The platform team was about to disable the integration when we asked them to give us two more weeks and the right to delete most of what the bot was allowed to say.

The Friday afternoon mute

The bot itself was not the problem. The model was strong enough. The prompt was carefully written. The integration with GitHub was clean. What was killing it was the same thing that kills most code-review agents: it had opinions about everything.

This is the failure mode the Open Code Review CLI, which hit Hacker News this week, has to fight just like everyone else. An LLM that reads diffs will always have something to say. The interesting question is when it should keep its mouth shut.

The team's complaint was simple. When a junior dev opens a pull request and sees twelve bot comments, three of which are wrong, the signal-to-noise ratio collapses. They start scrolling past everything. The next time the bot catches a real bug, no one reads the comment.

Why most review bots end up muted

We have shipped review agents into four codebases now. The pattern is consistent. The first week is exciting. The second week is grim. By the third week, someone on the team has written a Slack thread titled "can we kill the bot."

The root cause is rarely the model. It is the surface area. A code-review bot that is configured to review every file in every language ends up making the same trade-off a tired junior reviewer does: it talks about style because style is easy. It talks about naming. It talks about whether a comment should be in past or present tense. It is right about all of these things some of the time, and wrong about them the rest of the time, and the team has no way of knowing which is which until they argue with it.

The pattern that kept working in our other projects was narrowing. Not a better prompt, not a smarter model. Just a much smaller set of paths the bot was allowed to comment on.

The confidence ledger

Before we cut anything, we needed evidence. We added a Postgres table called review_outcomes that logged every comment the bot posted, the file path, the rule that triggered it, and what the human reviewer did next.

CREATE TABLE review_outcomes (
  id            bigserial PRIMARY KEY,
  pr_url        text NOT NULL,
  file_path     text NOT NULL,
  rule_id       text NOT NULL,
  posted_at     timestamptz NOT NULL DEFAULT now(),
  resolution    text CHECK (resolution IN ('accepted','rejected','ignored')),
  reviewer_note text
);

Accepted meant a code change in a follow-up commit that referenced the bot's comment. Rejected meant a human explicitly disagreed in a reply. Ignored meant the PR was merged with the comment still open. The bot was set to comment on a wide allowlist for ten working days while the ledger filled up. A small webhook listened to GitHub's review-comment events and wrote into the table.

At day ten the table had 1,184 rows. We grouped by file extension, then by directory pattern, then by rule. We were looking for every (path-pattern × rule) pair where the bot had posted at least eight comments and been rejected zero times.

The four file types that earned a voice

Four cohorts survived the cull. The bot had posted at least eight times against files matching each pattern, and not a single comment had been argued with by a human reviewer.

Raw SQL migrations under db/migrations/*.sql. The bot caught missing IF NOT EXISTS clauses, indexes added without CONCURRENTLY on a table the team knew was large, and one migration that dropped a column still referenced in a queue handler.
Prisma schema changes in prisma/schema.prisma. Mostly: relations declared on one side but missing the inverse, and unique constraints added to columns where the data already had duplicates.
Next.js route handlers matching app/api/**/route.ts. The bot reliably caught missing input validation on the request body, and missing auth checks on routes that touched the user table.
Background job definitions in app/jobs/*.ts. Idempotency keys not set on payment-adjacent jobs, retry budgets missing, and one case of a job that called an external API without a timeout.

Notice what these have in common. They are narrow. They map to a small number of well-defined failure modes. They are the places where a careful senior engineer would slow down and read twice. They are not the places where taste matters.

What we cut

Everything else got cut. React components. Tailwind classes. Test files. TypeScript utility modules. Helper files in lib/. Markdown. The bot was no longer allowed to look at any of them.

The router is the entire policy. It is fifteen lines of TypeScript and it is the most important part of the integration.

// review-router.ts
const TRUSTED_PATHS: RegExp[] = [
  /^db\/migrations\/.+\.sql$/,
  /^prisma\/schema\.prisma$/,
  /^app\/api\/.+\/route\.ts$/,
  /^app\/jobs\/.+\.ts$/,
];

export function shouldReview(file: string): boolean {
  return TRUSTED_PATHS.some((re) => re.test(file));
}

For every file in a pull-request diff, the runner calls shouldReview. If the answer is false, the bot does not even read the file. No tokens spent, no comment posted, no chance of being wrong.

Takeaway

A review agent earns trust by refusing to speak. The hard work is not the prompt. It is deciding which files the bot has earned the right to touch.

The rule library

Each trusted cohort has a short list of rules. For the route-handler cohort, three rules cover most of the value:

Every handler that reads from request.json() must validate the body before using it. Missing validation is one of the OWASP Top 10 injection patterns and shows up in our ledger more than any other route-handler issue.
Every handler under /api/admin/** must call the admin-auth helper as the first statement of the function.
Every handler that writes to the user, payment, or subscription tables must run inside a transaction.

The rules are tied to the cohort, not to the model. The prompt the LLM sees is short and ends with the three rules above and three or four examples of past comments that were accepted. The model is not asked to reason about the whole codebase. It is asked to check three things in one file.

Shipping it as a quiet GitHub check

We posted the bot's output through GitHub's Checks API rather than as inline review comments. A check appears as a status indicator on the PR. If the bot has nothing to say, the check shows green and disappears into the bottom of the PR conversation. If it has something to say, the check shows a single neutral annotation with the suggestion in it.

This changed the social dynamic. An inline comment from the bot is a thing a developer has to dismiss, and dismissing things feels like work. A neutral check is a thing they can ignore until they care. The cost of being wrong dropped to nearly zero, which let us be slightly more aggressive about what we did flag.

We also locked the bot to one comment per file per PR. If it spotted three things wrong with a migration, it bundled them into a single annotation. The team had told us that a wall of bot-text on one file was indistinguishable from spam.

Six weeks in

The narrow bot has now been live for six weeks. The ledger keeps filling up. The numbers from the last 28 days:

74 pull requests touched at least one file in the four trusted cohorts.
The bot posted on 41 of them. It stayed silent on 33.
31 of the 41 comments led to a code change in a follow-up commit.
9 were merged with the comment open, then revisited within a week.
1 was explicitly rejected, on a Prisma relation the team had a reason to keep one-sided.

That single rejection mattered. The rule that produced it has been retired pending more evidence. The cohort is now "Prisma schema, minus the inverse-relation check." Confidence is not static. The ledger keeps running, and the policy moves with it.

The most useful number, though, is not in the ledger. It is in the platform team's Slack. No one has asked us to disable the bot in six weeks. The senior engineer who muted it on day three has since unmuted it. He said the bot was finally "behaving like a colleague who only talks when they know something."

What this is not

This is not a vulnerability scanner. It will not find a SQL injection in a route handler that does not match the pattern. It will not catch a memory leak in a React component. Anthropic's recently open-sourced framework for AI-powered vulnerability discovery is a different shape of tool, and the team uses it for a different job.

It is also not a replacement for human code review. It is a colleague who only speaks up in four very specific situations. The team's pull-request etiquette has not changed. Two human approvals are still required. The bot is just one more pair of eyes on the four places where missing one thing tends to cost a weekend.

One thing you can do this week

If you already have a code-review bot installed and the team mutes it, run a one-week experiment. Log every comment it posts, the file it touched, and whether the comment led to a code change. At the end of the week, throw away every (file-pattern × rule) pair where the bot has been wrong even once. Keep what survives. Ship that.

When we built this for the Utrecht SaaS, the hardest part was not the model and not the GitHub integration. It was getting the team to agree, in a single afternoon, on the four paths the bot had earned the right to talk about. The same narrow-scope discipline keeps showing up in every AI agent we ship: the system earns trust by refusing to speak unless it knows. The Friday afternoon mute is a feature of the bot, not a failure.

Key takeaway

Narrow your code-review bot to the file types where it has never been wrong, and the team stops muting it.

FAQ

Why only four file types?

Because those are the only paths where the bot's ledger of past comments showed zero rejected ones. Trust is earned per file pattern, not granted across the whole repo.

Does the narrow bot replace human code review?

No. Two human approvals are still required on every PR. The bot is an extra pair of eyes on four file types where missing one thing tends to cost a weekend.

How do you keep the bot honest over time?

The outcomes ledger keeps running. If a rule starts getting rejected by humans, it is retired or scoped down until the ledger shows zero rejections again.

Can this approach work without an LLM?

Largely, yes. The narrowing is the trick. Most of the value comes from confining review to high-stakes file types. The LLM helps with phrasing and edge cases inside that scope.

ai agentsautomationcase studytoolingworkflowarchitecture

Building something?

Start a project