Tooling

TDD for Laravel agents: three skills, three failure modes

Three TDD skills ran the same eight Laravel tickets against a 90k-line codebase. The demos all looked clean. The real runs broke in three different ways.

Jacob Molkenboer· Founder · A Brand New Company· 6 Jun 2026· 6 min

Three manila tickets fanned with brass pins and a green tab beside a brass stamp on ivory paper.

The brief landed on a Friday: eight feature tickets, a 90k-line Laravel monolith stitched together since 2017, and a question the team wanted answered before they bet their Q3 roadmap on agents. Which TDD skill should sit in front of the coding model? We ran three of them through the same backlog. Two finished. None of the runs looked like the demos.

The setup

The codebase is a Laravel 11 application on PHP 8.3 with a 92k LOC count (excluding vendor and migrations). It has 4,100 PHPUnit tests, two thirds passing on main, the other third skipped behind a @group flaky tag that nobody owns. Eloquent everywhere, six service classes, two queue workers, one cron that we are scared of.

The eight tickets were a real Q2 sprint: a refund flow change, a CSV importer fix, a permissions check on a controller, two reporting endpoints, a Stripe webhook idempotency bug, a typo on an invoice template, and a feature flag for an A/B test on checkout.

The three skills:

Skill A, a test-first skill that forces the agent to write a failing PHPUnit test, run the suite, then write code until red turns green.
Skill B, a guarded skill that wraps pest runs and refuses any code edit that wasn't preceded by a new failing assertion. This is the shape the recent HN front-page discussion of agent TDD skills was circling.
Skill C, a characterization-first skill that, before touching any file with line coverage below 60%, generates pin-down tests against the current behaviour, then makes the change.

Same model, same temperature, same max tokens. Same agent runner. Same eight tickets, in the same order.

What the demos show

In a fresh 200-line repo, all three look great. Red, green, refactor. The agent picks up the spec, writes a test that exercises one branch, watches it fail, writes the smallest patch, watches it pass. Clean.

In a 90k-line codebase, all three made the same first mistake within forty minutes.

Skill A: test-first

Skill A passed five of eight. The two it failed: the refund flow and the Stripe webhook. The reporting endpoint, the agent gave up on after eleven minutes of failing test runs.

The failure mode is the one nobody mentions in demos. When the agent does not yet understand the existing fixture data, it writes a test-first test that is technically failing for the wrong reason. The model sees red, declares victory on phase one, then writes code that turns that red green by route-shifting the assertion. The code merges. The bug ships.

A real example from the refund ticket:

public function test_refund_creates_credit_note(): void
{
    $order = Order::factory()->paid()->create();

    $this->postJson("/api/orders/{$order->id}/refund")
        ->assertOk();

    $this->assertDatabaseHas('credit_notes', [
        'order_id' => $order->id,
    ]);
}

This looks correct. It is not. The paid() factory state in this codebase silently sets payment_method = legacy_v1, a branch that hasn't run in production since 2022 and skips credit note creation entirely. The agent's fix was to add a hard-coded legacy_v1 skip in the controller. The test went green. Production behaviour did not change.

The lesson: a failing test is not a specification. In any codebase older than two years, half your factories lie about what state they create.

Skill B: the guarded runner

Skill B passed four of eight. It is the strictest of the three. No code edit is permitted without a fresh failing assertion in the same patch. On paper this is the right shape. In practice it produced the most expensive run by token count, because the agent kept proposing edits that the guard rejected, then re-proposing the same edit with a fake test stapled on.

The Stripe webhook ticket is the cleanest illustration. The bug: when a charge.succeeded event arrived twice (which Stripe documents as a retry guarantee, see the official webhooks reference), the second handler call created a duplicate Payment row. The agent's test, generated to satisfy the guard:

public function test_duplicate_webhook_is_idempotent(): void
{
    $payload = $this->stripeFixture('charge.succeeded.json');

    $this->postJson('/webhooks/stripe', $payload);
    $this->postJson('/webhooks/stripe', $payload);

    $this->assertEquals(1, Payment::count());
}

The test passed before any code change. The reason: the test database was seeded with an existing Payment row that shared the same stripe_charge_id, and the controller's firstOrCreate happened to do the right thing for that specific seed. The agent shipped a fix that was a no-op. The real bug, a race condition under concurrent worker dequeue, was untouched. Two weeks later it surfaced in production. The post-mortem was awkward.

A guard that requires a failing test before edits is a guard against laziness, not a guard against wrong tests.

Skill C: characterization-first

Skill C passed six of eight, and is the only one we still use. The trick is what it does before touching the change.

For each file the patch will modify, the skill checks line coverage in the existing suite. If coverage is below 60%, the agent's first task is not to fix the bug. It is to write tests that pin down what the file does today, including the wrong thing. Only once that net is in place does the agent attempt the change. Michael Feathers wrote the book on this pattern fifteen years ago, and it survives the agent era intact. See his Working Effectively with Legacy Code if you want the original.

The Stripe webhook ticket under Skill C produced six characterization tests first, including one that captured the concurrent-worker race as a flaky failure. The agent flagged the flake instead of papering over it, then proposed a database-level unique constraint on (stripe_event_id, type) plus a try/catch on QueryException. That patch shipped.

The two tickets Skill C failed were both small. The invoice typo (the agent over-engineered a Blade refactor) and the feature flag (the agent kept trying to test the flag system itself instead of the branch it gated).

Where each one breaks

All three skills, on day one, in a fresh repo, look identical. The difference is what happens at minute 40 in a real codebase.

Test-first breaks when fixtures lie. The agent treats a failing test as a spec, and the spec is wrong.
Guarded breaks when the test happens to pass for the wrong reason. The guard is satisfied. The bug survives.
Characterization-first breaks when the change is genuinely new behaviour with no existing surface area to pin down. It is slow on greenfield work.

One more observation from the same run: token cost was not a useful predictor of quality. Skill B used 2.3x the tokens of Skill C and shipped fewer working patches. The HN thread asking whether Claude increased bugs in rsync is a useful sibling read here. Output volume is not skill.

Takeaway

In a legacy codebase, the test the agent writes first is almost always the test you wish it had written second.

What we run now

For internal Laravel work we run Skill C with two modifications. Coverage threshold is set per-directory, not global (controllers at 80%, jobs at 70%, mailers at 30%). And we explicitly mark factories as trusted or suspect via a tag, so the agent knows which factory states to treat as ground truth and which to verify against a fresh Model::create() call.

When we built the AI agents for a Dutch logistics client running a Laravel 10 monolith of similar shape, the thing we ran into was Skill B's silent-pass mode on a queue retry handler. We ended up solving it by adding a pre-test step that asserts the test actually fails when the production code path is commented out, before the agent is allowed to write the fix.

The five-minute audit you can run today: open the last ten PRs your team merged that included new tests. For each one, comment out the line of production code the test was meant to cover, rerun the suite, and count how many of those tests still pass. If more than two do, your TDD culture is shaped like Skill B. The fix is at the skill layer, not the model layer: force the agent to verify the test fails for the right reason.

Key takeaway

In a legacy codebase, the test an agent writes first is almost always the test you wish it had written second.

FAQ

Which TDD skill works best for Laravel codebases over 50k lines?

Characterization-first wins in our tests, because it forces the agent to pin down current behaviour before touching legacy code. The other two pass tests for the wrong reasons too often.

Why do test-first agent skills fail in legacy codebases?

Factories and fixtures often lie about the state they create. The agent treats a failing test as a spec, but the spec is wrong, so it writes code that passes the wrong test.

Can guard-based TDD skills prevent agents from cheating?

Not reliably. A guard only checks that a test exists and fails first. If the test passes for an unrelated reason after the patch lands, the guard cannot tell the difference.

How do I audit my team's existing test culture?

Pick ten recent PRs, comment out the production code each new test was meant to cover, and rerun the suite. If more than two tests still pass, your TDD is shape over substance.

Do these failure modes go away with a smarter model?

No. They are structural. A smarter model writes the wrong test faster. The fix is at the skill-layer level, not the model level: force the agent to verify the test fails for the right reason.

toolingai agentsautomationphparchitectureworkflow

Building something?

Start a project