Security

AI vulnerability discovery in CI: a legacy PHP playbook

A 200k-line PHP codebase, a CI pipeline that still works, and Anthropic's new vulnerability-discovery framework. Here is how we wired all three together in one afternoon.

Jacob Molkenboer· Founder · A Brand New Company· 5 Jun 2026· 9 min

Half-open leather ledger with brass key, green wax seal, brass loupe and folded carbon slip on ivory paper surface.

Wednesday afternoon, 14:00. A logistics company in Rotterdam sends us a Git URL: 200,000 lines of PHP, untouched since 2021, running their entire order pipeline. A junior developer had pointed a generic static-analysis tool at it last month and produced a 3,200-finding PDF that nobody opened. The security committee meets at 18:00 and wants a real answer.

We had four hours.

The brief, in one sketch

The week before, Anthropic's open-source framework for AI-powered vulnerability discovery had hit the front page of Hacker News with several hundred points and a long, sceptical thread. The interesting argument in that thread, roughly, was that the LLM-as-reviewer pattern had finally crossed the line from "demo" to "useful inside a build pipeline." We wanted to test that claim against a codebase nobody on the client's team enjoyed touching.

The brief was small. By the end of the afternoon, every pull request on this repository should run an agent-driven scan, fail the build on anything high severity, and surface findings inside GitHub's code-scanning view. No new dashboards. No new accounts. No Slack bots. The team already had alert fatigue.

Why a PHP monolith is a forgiving target

PHP monoliths get a bad reputation, and not always unfairly. But for an LLM reviewer they are friendly in three ways. The control flow is mostly local: a request enters one front controller, runs through a handful of includes, and returns. There is no JIT, no async runtime, no metaprogramming layer thick enough to confuse a model. And the historical bug classes are well-documented. SQL injection through string concatenation. XSS in echo statements. unserialize() on user input. Weak crypto in legacy session handlers. A model trained on the public web has seen every variant on the OWASP Top 10.

The bad news: a 200k-line monolith likely has every variant several times over. You cannot ask one agent to read the whole codebase in a single pass without burning the client's quarterly token budget. You need to chunk, and you need to be honest about the chunking strategy.

The half-day plan

We split the afternoon into three passes.

Pass one was a local dry-run on a developer laptop. The goal was to read the noise floor before involving CI at all.

Pass two was a CI job that ran on every pull request, but only against the diff plus a narrow context window. This is the gate that protects future work.

Pass three was a nightly full-codebase sweep, queued so it never blocked human work, with the results dumped into GitHub's code-scanning view.

Total framework invocations across the afternoon: about 140. Total spend: under €60. The 3,200-finding PDF the junior developer had produced shrank to 38 real findings after ninety minutes of triage, eleven of which we patched the same evening.

Pass one: baseline on a laptop

Before any pipeline work, clone the repo, install the framework, run it against one directory you understand well, and read every finding by hand. You are doing this to calibrate the model, not to ship results.

git clone git@github.com:client/orderflow.git
cd orderflow

# install per the framework's README; we pinned to a tagged release
export ANTHROPIC_API_KEY=sk-ant-...

./bin/vulnscan scan \
  --path app/Catalog \
  --severity high \
  --format sarif \
  --out catalog-baseline.sarif

Open catalog-baseline.sarif in any SARIF viewer (VS Code has a decent one) and read every finding. Three categories will appear. Real bugs you patch immediately. False positives you note for suppression. And "interesting but not exploitable" notes that go in a third file for the next pentest. If more than 40% of findings are false positives at your chosen severity, raise the threshold before going further. We landed at about 18% false-positive rate at high severity on this codebase, which we judged acceptable.

Warning

Always set a hard cost ceiling at the pipeline level, not just inside the framework config. A buggy retry loop with no upper bound can drain a monthly token budget in under an hour. We use both: the framework's own max-cost flag, plus a GitHub Actions timeout-minutes that aborts the job after four hours of runtime.

Pass two: CI on diffs only

The full repo had 200,000 lines. Sending all of it to the model on every pull request would have melted the budget within a week. So the PR job only scans the diff plus a small context window of surrounding files.

# .github/workflows/vulnscan-pr.yml
name: vulnscan (PR)
on:
  pull_request:
    branches: [main]

jobs:
  scan:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    permissions:
      contents: read
      pull-requests: write
      security-events: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Compute changed PHP files
        id: diff
        run: |
          git fetch origin ${{ github.base_ref }}
          CHANGED=$(git diff --name-only \
            origin/${{ github.base_ref }}...HEAD -- '*.php' \
            | tr '\n' ' ')
          echo "files=$CHANGED" >> "$GITHUB_OUTPUT"

      - name: Run vulnscan on diff
        if: steps.diff.outputs.files != ''
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          ./bin/vulnscan scan \
            --files "${{ steps.diff.outputs.files }}" \
            --context-radius 2 \
            --severity high \
            --format sarif \
            --out vulnscan.sarif

      - name: Upload SARIF to code scanning
        if: steps.diff.outputs.files != ''
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: vulnscan.sarif

Two notes on the YAML.

First, context-radius is the flag we cared about most. The idea: send the model the changed function plus N files up the call graph, so it can see how a parameter arrives. If your framework does not expose this, fake it by passing the diff plus any file matching the changed class name. Context is the difference between catching $_GET['id'] entering a query four files later and missing it entirely.

Second, uploading SARIF to GitHub's code-scanning view means you do not need a new dashboard. Findings appear next to Dependabot alerts and CodeQL results. The team already knows where to look.

Pass three: the nightly sweep

# .github/workflows/vulnscan-nightly.yml
name: vulnscan (nightly)
on:
  schedule:
    - cron: '0 2 * * *'  # 02:00 UTC
  workflow_dispatch:

jobs:
  full-scan:
    runs-on: ubuntu-latest
    timeout-minutes: 240
    permissions:
      contents: read
      security-events: write
    steps:
      - uses: actions/checkout@v4

      - name: Full-codebase scan
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          ./bin/vulnscan scan \
            --path . \
            --severity medium \
            --chunk-size 4000 \
            --max-cost 25 \
            --format sarif \
            --out vulnscan-full.sarif

      - uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: vulnscan-full.sarif
          category: vulnscan-full

The --max-cost flag is the one to argue about with finance. We capped it at €25 per nightly run, which is enough to sweep 200,000 lines in one pass and low enough that a runaway loop cannot silently spend €4,000 by morning. If your framework does not expose a cost ceiling, write a wrapper that watches token spend and kills the process when it crosses your threshold.

Triage that the team will actually do

The first nightly run produced 312 findings at medium-or-above severity. Same problem the junior developer had with the PDF: nobody opens 312 findings on a Tuesday morning.

We did three things.

One, we wrote a small jq filter that grouped findings by file and ranked files by total severity. The team triaged top files, not top findings. This collapsed 312 individual entries into 47 file-level groups, of which 18 had any high-severity content and got read first.

jq -r '
  .runs[0].results
  | group_by(.locations[0].physicalLocation.artifactLocation.uri)
  | map({
      file: .[0].locations[0].physicalLocation.artifactLocation.uri,
      count: length,
      severities: (map(.properties.severity) | unique)
    })
  | sort_by(-.count)
' vulnscan-full.sarif

Two, we added a .vulnscan-suppress.yml file at the repo root listing accepted findings by rule id and file path. After triage, anything still in the SARIF that was not on the suppress list became a build failure on the next PR touching the file. Anything on the suppress list still appeared in code scanning but did not block work.

# .vulnscan-suppress.yml
suppressions:
  - rule: weak-rand-session
    path: legacy/auth/session_legacy.php
    reason: scheduled for removal in Q3 migration; ticket SEC-441
    expires: 2026-09-30
  - rule: php-unserialize
    path: tools/import/csv_legacy_import.php
    reason: internal CLI only, no network exposure
    expires: 2027-01-01

The expires field matters more than the reason. A suppression without an expiry is a permanent excuse. A suppression with a date is a calendar invite.

Three, the high-severity gate stayed strict in one direction only: a new high-severity finding introduced by a pull request fails the check. A pre-existing high-severity finding from the nightly does not fail PRs that do not touch the file, because that bug existed yesterday and blaming today's PR for it is how teams learn to disable the gate.

What stays human

We did not let the framework approve anything in three areas.

Authentication and session handling. The model is decent at spotting weak session tokens or missing CSRF checks. The cost of a false negative here is whole-account takeover. Two human reviewers on any auth change, regardless of what the agent says.

Anything touching payment flow. This is a regulatory call as much as a technical one. The PSD2 audit trail expects named reviewers, and an LLM cannot be one.

Business-logic authorization. The model knows that calling is_admin() is good. It does not know that is_admin() returns true for any user whose ID is divisible by seven because of a 2018 marketing campaign nobody has the courage to remove. Domain quirks need domain reviewers.

For everything else, the classical web vulnerabilities, deserialization, file inclusion, command injection, the framework caught real bugs the team had missed for years. The first night's 38 confirmed findings included three SQL injections in product search, an unserialize() on a cookie value, and a path-traversal bug in a CSV-download endpoint that had been live since 2017. None required clever prompting. The model read the code, wrote it up, and pointed at the offending line.

The honest part

This does not replace a pentest. What it replaces is the gap between pentests. A serious quarterly review still needs humans who can chain three bugs into a real attack and who understand the business well enough to know which attack matters. But the noise that humans waste their first day on, the obvious string-concat SQL, the obvious unescaped echo, the obvious unserialize, that work belongs to the pipeline now. Pentesters then start from day two, which is where they were always more interesting anyway.

When we wired this up for the Rotterdam logistics team, the friction was not the framework itself. It was getting GitHub Actions to play nicely with SARIF upload on a private repo (you need security-events: write and either GitHub Advanced Security or public visibility for the code-scanning UI to render), and writing the suppression file in a way the team would maintain instead of resent. We do this kind of AI-into-CI work for clients with PHP, Laravel and Magento monoliths; the half-day version is real and we have run it on six codebases now.

If you have a legacy PHP codebase and a CI you trust, the smallest useful thing to do today is to run the framework against one directory you understand well, on your laptop, by hand, before touching the pipeline. Read every finding. The noise floor you measure in that first run is the only honest input to every decision that follows.

Key takeaway

The pipeline catches the bug classes humans waste their first day on. Pentesters then start on day two, which is the day they were always more useful.

FAQ

Does this replace a pentest?

No. It catches the obvious bug classes (SQL injection, XSS, deserialization, path traversal) so human pentesters start work on day two of an engagement instead of day one. The interesting attacks still need a human.

What does the CI bill look like for a 200k-line repo?

Diff-only PR runs land between €0.20 and €0.40 each. A nightly full sweep with a €25 ceiling is enough for 200k lines in one pass. Total monthly cost on an active repo: €60 to €120.

Does this work on private GitHub repositories?

Yes, but rendering findings inside GitHub's code-scanning UI on a private repo needs GitHub Advanced Security, a paid add-on. The SARIF still uploads and is queryable through the API either way.

What about Laravel, Symfony or Magento, not raw PHP?

Same playbook. Framework-aware findings work because the model recognises Eloquent query builders, Symfony route attributes and Magento controllers. Adjust the directory list in pass one to your framework's structure.

securityphplegacy sitesai agentsautomationtooling

Building something?

Start a project