Security

AI vulnerability scanning: a legacy PHP refactor pre-flight

Before a single line of refactor, we pointed an open-source AI vulnerability scanner at 60,000 lines of legacy PHP. Here is the toolchain and the punch list it produced.

Jacob Molkenboer· Founder · A Brand New Company· 28 May 2024· 8 min

Closed leather logbook, brass magnifier on cream card, green silk ribbon, tarnished key and iron tag on ivory paper.

The codebase landed on our shared drive on a Thursday evening. A zip of 60,412 lines of PHP, written in fragments between 2011 and 2024, four different coding conventions, two ORMs (one homegrown, no documentation), and a directory called /admin/ with a file called do_things.php. The client wanted to migrate it onto Laravel by autumn. We told them to wait until we had run a vulnerability scan over the whole thing.

Before any refactor, we wanted a baseline. Not a tidy list of typescript-style warnings. A real inventory of what the application gets wrong when it touches user input, file paths, and the database. The kind of map that tells you which parts of the codebase you must not touch until you have backups, and which parts are quietly bleeding right now.

Scan first, refactor second

Most legacy migration projects do this in the wrong order. The team picks a target framework, schedules a rewrite, and finds the security holes mid-translation, half of them surviving into the new system because the developer copied a query verbatim "to match the old behaviour." By then the rewrite is the only place where the bug exists, and nobody knows when it landed.

Running the scan first solves three things at once. You get a punch list of issues that need a hotfix before anyone touches the live system. You get a vocabulary for the rewrite: every PR can say "fixes vulnerability VL-014, mapped from legacy/users.php:142". And you find the parts of the old code that were never as bad as the team assumed, which saves rewrite time you would have spent re-implementing the wrong abstraction.

The four-tool baseline

We do not run a single scanner. Every tool catches a different shape of bug, and the overlap is where you get confidence. Our default stack on a PHP project looks like this:

Semgrep with the p/php and p/security-audit rulesets, plus a handful of project-specific rules.
Psalm with --taint-analysis enabled, which traces user input from source to sink across the type graph.
PHPStan at level 6, mostly for the type-safety findings that surface dead branches and missing null checks.
An open-source AI vulnerability scanner. The category is young, but the mature ones reason across function boundaries instead of matching on local patterns, which is where the interesting bugs live.

The first three are pattern matchers and type checkers. They are fast, predictable, and they miss anything that requires reasoning across more than one or two files. The fourth, the AI scanner, picks up multi-step vulnerabilities where an attacker controls one variable, it flows through three function boundaries, and ends up in a sink nobody flagged because nobody wrote the rule.

Running all four in one pass

We wrap the scanners in a small Bash script that writes structured output to disk. Each tool emits JSON or SARIF, which keeps the downstream triage simple.

#!/usr/bin/env bash
set -euo pipefail

REPO="${1:-./legacy-app}"
OUT="./scan-output/$(date +%Y%m%d-%H%M)"
mkdir -p "$OUT"

# 1. Semgrep
semgrep --config p/php --config p/security-audit \
        --json --output "$OUT/semgrep.json" "$REPO"

# 2. Psalm with taint analysis
( cd "$REPO" && psalm --taint-analysis --report=psalm.sarif ) \
  && mv "$REPO/psalm.sarif" "$OUT/psalm.sarif"

# 3. PHPStan
( cd "$REPO" && phpstan analyse --error-format=json --level=6 ) \
  > "$OUT/phpstan.json" || true

# 4. AI vulnerability scanner
ai-vuln-scan --target "$REPO" --json > "$OUT/ai-scan.json"

echo "Done. Findings written to $OUT"

On the 60k-line codebase, the full sweep took 38 minutes on an M2 Pro. The AI scanner dominated the runtime at 31 of those minutes, which is expected: it reads files end to end rather than matching on patterns.

Deduplicating the noise

The raw output was loud. Semgrep flagged 198 findings, Psalm 89, PHPStan 1,407 (most of them type-safety issues unrelated to security), and the AI scanner 312. Many of these pointed at the same code with different vocabulary. Our triage script normalises everything to (file, line, category) tuples and groups them.

import json, collections

def load(path, mapper):
    with open(path) as f:
        return [mapper(x) for x in json.load(f)]

# Each mapper returns (file, line, category, severity, source)
findings  = load("semgrep.json",  semgrep_map)
findings += load("psalm.sarif",   psalm_map)
findings += load("ai-scan.json",  ai_map)

groups = collections.defaultdict(list)
for f, l, cat, sev, src in findings:
    groups[(f, l, cat)].append((sev, src))

# A finding confirmed by two or more tools jumps to the top of triage.
confirmed = {k: v for k, v in groups.items() if len(v) >= 2}
solo_ai   = {k: v for k, v in groups.items()
             if len(v) == 1 and v[0][1] == "ai-scan"}

"Confirmed by two or more tools" is not a guarantee of a real bug, but it is a strong prior. On this project, 71 findings landed in confirmed. Another 184 were AI-only, of which roughly half held up after manual review, mostly multi-file flows the classic tools could not see.

Warning

AI scanners hallucinate code that isn't there. We treat every AI-only finding as a hypothesis, not a verdict. Two minutes of grep against the cited file saves hours of chasing ghosts.

What survived triage

After two days of review, the punch list looked like this. The categories are the boring ones from the OWASP Top 10, which is exactly what you expect from a codebase that grew over thirteen years without a security review.

SQL injection in 7 places. All of them string-concatenated WHERE clauses inside admin tools. One of them was reachable from a public route via a forgotten ?action=export handler.
Stored XSS in 23 templates. Direct echo $row['title'] with no escaping. Most were in low-traffic admin pages, but the customer-facing review page was one of them.
RCE via include $_GET['page']. Unauthenticated path. The file had been added in 2014 "as a temporary debug helper." It outlived the developer who wrote it.
A hardcoded Mailgun API token sitting in config.inc.php, committed in 2017, still valid, still in the live config. The token had send permissions on the domain.
Session fixation in the login flow. No session_regenerate_id() after authentication. Trivial to weaponise once you have any XSS on the same domain, which we did.
Mass-assignment in the homegrown ORM, where a single helper copied every $_POST key onto the model object. The users model had an is_admin column.

The hardcoded token was the kind of finding that the AI scanner caught and the classic tools shrugged at. Semgrep has rules for it, but only if the token matches a known regex format. The Mailgun key had been rotated to a custom prefix years ago and slipped past every pattern matcher we tried.

Patching before the rewrite

We hotfixed the critical six in the legacy codebase before opening a single Laravel file. The migration was going to take eight weeks. The RCE and the SQLi could not wait eight weeks. The fixes were ugly (parameterised queries shoved into a code style that did not expect them) but they were correct, and they gave us a safe baseline to rewrite from.

Every hotfix went into a tracked branch with a one-line note: VL-002: SQLi in legacy/admin/export.php. Fixed via prepared statement. Re-implement properly in UserRepository. When the Laravel side reached that area weeks later, the PR description picked the note back up and closed the loop.

How the AI scanner changed our triage

The classic tools are still doing most of the work. Semgrep finds the patterns, Psalm traces the types, PHPStan keeps the codebase honest. What the AI scanner added was something none of the others could give us: a description of why a finding was dangerous, in prose, with the data flow spelled out. That speeds up triage enormously. Instead of opening four files to understand whether a finding is real, you read two paragraphs and confirm against one file.

The trade-off is honesty. Roughly one in five AI-only findings was a confident hallucination. The scanner described a function that did not exist, or a flow that ended at the wrong sink. Treat the output the same way you would treat an enthusiastic junior with no commit access: useful, motivated, occasionally wrong about the codebase they just read.

The five-minute version

If you cannot run the full stack today, pick the one file in your codebase you would not want a stranger reading carefully. Run semgrep --config p/security-audit path/to/file.php and read the top three findings before lunch. If any of them surprise you, the rest of the codebase is going to surprise you more. That is the moment to schedule the real audit.

When we built the pre-migration audit for a Dutch SaaS client running on PHP 5.6, the thing we ran into was that the static scanners and the AI scanner disagreed on close to 40% of findings. We ended up writing a small reachability filter that ranks issues by how many hops they sit from a public route, which cuts the noise in half. That kind of pre-flight is now part of every legacy migration we take on.

Key takeaway

Run the security scan before the refactor, not after. Every vulnerability you find first becomes a checklist item; every one you find during the rewrite becomes a defect that ships.

FAQ

Which scanner should I start with on a legacy PHP codebase?

Semgrep with the p/security-audit ruleset, because it runs in seconds and surfaces the obvious wins. Once you have a baseline, add Psalm with taint analysis and an AI scanner on top.

Can an AI vulnerability scanner replace traditional static analysis?

Not yet. AI tools catch multi-file flows that pattern matchers miss, but they hallucinate. Run both, and trust findings that two independent tools agree on.

When should you fix vulnerabilities in legacy code instead of waiting for the rewrite?

Anything reachable from a public route gets a hotfix in the legacy codebase first. A rewrite takes weeks. SQL injection and RCE cannot wait that long.

How long does a full security scan take on a 60k-line PHP codebase?

Around 40 minutes end to end on a modern laptop. Semgrep and PHPStan finish in seconds. Psalm with taint analysis takes a few minutes. The AI scanner dominates the runtime.

securityphplegacy sitesmigrationtoolingai agents

Building something?

Start a project