← Blog

Security

AVG-retrofit for AI agents: the 03:00 audit checklist

It's 03:14 on a Tuesday. A CTO outside Eindhoven gets an email from the Autoriteit Persoonsgegevens. Subject: artikel 30, deadline 72 hours. Three agents in scope.

Jacob Molkenboer· Founder · A Brand New Company· 22 Jun 2026· 9 min
Open leather ledger with brass key, green tab, red wax seal on ivory paper by window.

It's 03:14 on a Tuesday. The CTO of a 38-person logistics planner outside Eindhoven has received an email from the Autoriteit Persoonsgegevens. Subject: verzoek om informatie verwerkingsregister, artikel 30 AVG. Deadline: 72 hours. Three of their AI agents — the one that drafts customer replies, the one that summarises sales calls, and the one that classifies expense receipts — process personal data. None of them appear in the register by name.

She opens the laptop. She starts a checklist.

This is the checklist.

The 03:00 question, restated

If an inspector asked you to prove which personal data your agents touched in the last 24 hours, which prompt-pad it came from, and which sub-processor saw it, could you answer in one query?

For most sub-€14M Dutch SMEs we audit, the answer is no. Not because they are sloppy. Because the agents shipped fast, the verwerkingsregister was written before the agents existed, and nobody updated Article 30 when a LangChain minor bump quietly added a third sub-processor. The checklist below is the one we run before we quote an AVG-retrofit. It scores three things: DPIA coverage on the top 20 prompt-paden, sub-processor disclosure on the top 8 vendor-API's, and which agents would survive an AP onderzoek without losing the trail at 03:00.

Scoring the top 20 prompt-paden

A prompt-pad is one trajectory a user input takes through your agent: from input handler, through tool calls, to the final response. Most teams underestimate how many distinct paden they have. A single inbox-agent with five tools and three branching prompts routinely sits north of 40 paden in production. We score the top 20 by traffic volume — anything below the long tail can wait for round two.

For each pad we record five rows:

  • Does the input contain a category of personal data under Art. 4(1) AVG? Name, email, IBAN, BSN, photo, voice, health, location — yes/no per category.
  • Is there a DPIA on file that covers this exact pad, or only the agent in general?
  • Does the pad include a free-text field where the user might paste a third party's data? The hidden-bijzondere-categorie problem.
  • Does the agent persist the prompt or response outside the model provider — vector store, observability tool, log aggregator?
  • Is there a retention timer on each persistence target, or does it grow forever?

Each row scores 0–3. Anything under 12/15 fails. Failed paden go on a one-page remediation table with named owners and dates.

The most common gap we find: the DPIA covers the agent as a product, but not the specific pad where a salesperson can paste a customer's CV into the chat to ask "draft me a polite rejection." That paste is a separate processing activity under Art. 35 AVG. It does not inherit coverage from the agent's product-level DPIA, and the AP reads it as a CV-screening processing the SME never declared.

Sub-processor disclosure on the top 8 vendor-API's

Most agent stacks we audit touch eight external API's: the LLM provider, the vector DB, the observability tool, the evaluation tool, the function-calling middleware, the speech-to-text service (for voice agents), the search provider, and the email sender. Each one is a sub-processor under Art. 28(2) AVG. Each one needs to be listed by name in the verwerkingsregister, with the data categories it processes and a working verwerkersovereenkomst.

Three things people miss:

  • The sub-processor's sub-processor. Anthropic uses Amazon Bedrock as one hosting layer; OpenAI uses Azure. Your DPA chain has to acknowledge that, including region. The data does not stop at the API name on your invoice.
  • The version pin. A LangChain minor bump in 2025 added a new telemetry endpoint that quietly forwarded request metadata to a third party. We caught variants of this on three audits in a row. If your verwerkingsregister does not pin the version, or you have no upgrade-review process, you are disclosing a vendor list that is already out of date.
  • The vendor's own auth-flow changes. Anthropic announced ID verification for certain API capabilities starting 8 July 2026. That changes who can call which model, and for agents that surface those flows to end-users it shifts both the legal-basis description and the controller/processor boundary.

We maintain an 8-row vendor matrix per client. Vendor, role under AVG, sub-processor location, data categories sent, processor agreement URL, last-reviewed date, escalation contact, version pinned in production. If any cell is blank, the audit fails on that line. No partial credit. The matrix lives next to the verwerkingsregister, not inside Notion where nobody reviews it.

The three agents that survive

After fourteen production agents and roughly forty client audits, we keep seeing the same three shapes survive an AP inspection. Everything else fails on the first or second question.

The narrow-scope agent

Does one thing — invoice chasing, lead qualification, ticket triage. Reads from one source. Writes to one source. Logs every prompt and response with a stable pseudonymised user-ID and a timestamp. Has a DPIA scoped to that one activity. No free-text "ask me anything" surface. Boring on purpose. Auditable by accident.

The retrieval-bounded agent

Has a RAG layer, but the retrieval index is documented as a separate processing activity in the verwerkingsregister, with its own retention policy and lawful basis. The agent cannot answer from outside the index. When the AP asks "where did this answer come from," there is a chunk-ID in the log that points to a row in the index, and that row points to the source document, and the source document is governed by the same retention timer as the chunk.

The audited-tool agent

Uses function-calling, but every tool call is logged with its arguments and return value to an append-only store with at least 12-month retention. When the AP asks "what did this agent do at 14:32 on 4 March," the answer is one query away, not a forensic exercise across three observability tools with conflicting timestamps.

Agents that fail: the "general assistant" chat window connected to half a dozen tools, where the prompt-pad is "whatever the user types" and the log is "whatever Datadog kept on the free tier." We have migrated five of those in the last six months. None survived their initial audit. One was scoped down to three discrete agents; the others were rebuilt against a router that constrains the pad set before the model sees the input.

The verwerkingsregister-trail at 03:00

The verwerkingsregister itself is just a spreadsheet under Art. 30 AVG. What kills people at 03:00 is the trail — the audit log that proves a specific prompt at a specific time was handled in a specific way. We standardise on the following minimum, written per agent invocation:

{
  "ts": "2026-06-22T03:14:08.221Z",
  "agent_id": "invoice-chaser-v3",
  "agent_version_sha": "a91f0c2",
  "user_id_pseudonym": "u_8c1e7d...",
  "prompt_pad_id": "pad.invoice.chase.followup",
  "input_data_categories": ["name", "email", "iban"],
  "model_provider": "anthropic",
  "model_id": "claude-sonnet-4-5",
  "sub_processors_invoked": ["anthropic", "supabase-vector"],
  "tool_calls": [
    { "name": "lookup_invoice", "args_redacted": true, "duration_ms": 142 }
  ],
  "retention_until": "2027-06-22T03:14:08Z"
}

This is the minimum that lets you answer the AP's question without dragging five engineers out of bed for a forensic log export. The prompt_pad_id is the field most teams miss; without it you can prove what the model said but not which trajectory it came from, and that is the half of the answer the inspector actually cares about.

Cloudflare's recent temporary accounts for AI agents proposal makes this easier in practice. If every agent run boots with a scoped, expiring identity, the user_id_pseudonym and sub_processors_invoked fields can be sourced from the identity layer instead of glued together from three logs after the fact.

Warning

The Datadog free-tier default retention is 15 days. Several agents we audited last quarter had verwerkingsregister entries promising 18 months. Those do not reconcile. Either change the register or change the retention. The gap is the violation.

What the retrofit costs, in hours not euros

For a stack of three to five production agents, the retrofit lands roughly at:

  • 4–6 hours to map the top 20 prompt-paden and score them.
  • 3–4 hours to fill out the 8-vendor matrix and chase missing verwerkersovereenkomsten.
  • 8–12 hours to wire the per-invocation audit log if it does not already exist.
  • 2–3 hours to rewrite the verwerkingsregister entries so they survive a side-by-side read against the live system.

That is a week of focused work for one engineer who already knows the stack, two weeks if they do not. The piece that genuinely cannot be done last-minute is the verwerkersovereenkomst chasing, because some vendors take three weeks to respond. Start there. Start today.

The closing sanity check

Before you sleep, pull one random agent invocation from the last 24 hours. Open the verwerkingsregister entry for that agent. Read the two side by side. If the log shows a data category, sub-processor, or retention window the register does not mention, you have your first remediation ticket. Repeat with two more random invocations. Three mismatches in a row means the register is fiction and the retrofit cannot wait.

When we built the invoice-chaser for a Rotterdam wholesaler last spring, the thing we kept hitting was sub-processor drift: the LLM provider added a fine-tuning endpoint, the vector DB rolled out a new region, and the register was three versions behind by week six. We solved it with a weekly diff-job that compares live config to the register and opens a ticket on mismatch. The same pattern sits underneath most of our AI agents work now.

One thing to do today, in five minutes: pick your highest-traffic agent, write its top three prompt-paden on paper, and list the data categories each one touches. If you cannot, that is your audit gap.

Key takeaway

If you can't show, in one query, which prompt-pad and which sub-processor touched a customer's data at 14:32 last Tuesday, your verwerkingsregister is fiction.

FAQ

Do we need a separate DPIA per agent, or one for the whole stack?

Per processing activity, not per product. One agent can contain multiple activities — a chat surface plus a CV-screening pad is two DPIA scopes, not one, even when they share a model.

How long should agent invocation logs be retained?

Long enough to answer an AP request and no longer. We default to 12 months for tool-call logs, 18 for the verwerkingsregister, and shorter for raw prompts if they contain personal data.

Is the LLM provider a controller or a processor?

For most B2B API usage, processor. Read the vendor's data-processing addendum and pin the version. If the vendor reserves training rights on your data, that flips the analysis and needs separate disclosure.

What triggers an AP onderzoek for an SME under €14M?

Usually a complaint from an employee or customer, a breach notification, or a sector sweep. Size does not protect you; the register-readiness test is the same as for an enterprise.

securityai agentsoperationsarchitecturestrategy

Building something?

Start a project