RAG

Azure OpenAI RAG audit: the checklist before we quote

Before we quote a RAG retrofit on Azure, we run a 14-point audit on the tenant. Index drift, content-safety coverage, and a 04:00 failover test.

Jacob Molkenboer· Founder · A Brand New Company· 22 Apr 2026· 9 min

Open oak card catalogue drawer with cream index cards, one tagged chartreuse, brass stopwatch and wax seal on ivory paper.

It is 04:17 on a Tuesday in March. An operations lead at a Rotterdam logistics firm sends us a screenshot. Their customer-service agent, built last summer on Azure OpenAI and Azure Cognitive Search, has just answered a question about hazardous-goods routing using a procedure from 2022. The procedure was rewritten in December. The citation in the answer points at an Azure blob URL that 404s. The customer asked. The agent answered. Nobody knows what to do about either fact.

This is the situation every Dutch SME with a half-finished RAG stack lands in eventually. Not because the tech is bad. Because nobody audited the tenant before they were asked to extend it.

When a prospect comes to us asking for a RAG retrofit, meaning please make our existing knowledge-base agent actually answer correctly, we do not quote until we have run the audit below. It takes us about a day. It has saved us from three projects we should never have taken on, and it has saved clients from rebuilds they did not need.

Why we audit before quoting

The pattern we see at sub-€18M Dutch SMEs is consistent. Someone spun up Azure OpenAI in 2023 or 2024. They added Cognitive Search (now branded Azure AI Search) because the docs said to. They pointed it at SharePoint or a blob container, watched it work in a demo, shipped it, and then nobody touched the indexer for fourteen months. By the time we arrive, the index is drifting, the content-safety configuration is whatever the portal defaulted to, and the failover plan is a Notion page that says "ask Marco."

Quoting a retrofit on top of that without an audit is how you end up rebuilding the thing for free.

Our checklist has fourteen points. Three of them matter more than the others, and we will spend most of this post on those: index-refresh drift across the top 30 containers, content-safety filter coverage on the top 15 prompts, and region-failover survival from West Europe to North Europe at 04:00. The rest, embedding model versions, RBAC scope, Key Vault rotation, log retention, are mechanical. These three are where the money is.

The first time we ran this audit end-to-end was in late 2024, after a client lost a week of customer-service capacity to an agent that cited a procurement policy that had been superseded for nine months. The retrofit they had budgeted for would not have caught it. The audit did, in about forty minutes, and the rebuild they thought they needed turned into two weeks of indexer plumbing and a citation-router change.

Index-refresh drift across the top 30 containers

The first thing we pull is a snapshot of every AI Search index in the tenant and the last successful indexer run for each data source. We rank the data sources by document count and look at the top 30 containers.

For each one we compute drift: the wall-clock gap between the most recent change in the source and the most recent successful index refresh that picked it up. Not the indexer's last-run timestamp, the timestamp of the last document it actually wrote.

The query that gets us there is simple. Pull the indexer status from the REST API, join against the source container's Last-Modified headers, and write the deltas to a CSV.

curl -s -H "api-key: $SEARCH_KEY" \
  "https://$SVC.search.windows.net/indexers?api-version=2024-07-01" \
  | jq -r '.value[] | [.name, .lastResult.endTime, .lastResult.status, (.lastResult.itemsProcessed|tostring)] | @tsv'

We rerun this every fifteen minutes for a working day and watch the variance. An indexer that completes successfully but ingests zero new items run after run is the classic silent failure: the schedule shows green, the metric Azure surfaces in the portal shows green, and the index is months behind reality. The itemsProcessed field in the JSON above is the only thing that catches it cheaply.

What we look for is the long tail. The top three indexers are almost always healthy because somebody is watching them. The drift hides between rank 8 and rank 25, in the indexers nobody noticed had been silently failing on a single malformed PDF for nine months, which means everything behind that PDF in the queue never made it in.

Our threshold: any index where drift exceeds 72 hours on a daily-updated source, or where the indexer has logged the same warning more than three runs in a row, fails the audit. In a typical Dutch SME tenant with 30+ containers, we expect to fail four to seven of them. If we fail more than ten, the retrofit becomes a re-platform conversation, and we say so in writing before quoting.

Content-safety coverage on the top 15 prompts

Next we pull the top 15 user prompts by volume from Application Insights, or whatever logging the team wired up. Usually it is not Application Insights, and the first half-day is spent reconstructing logs from the agent's stdout.

For each prompt, we run it through the existing pipeline twice. Once as-is. Once with a small adversarial mutation: a swapped pronoun, a request reframed as a hypothetical, a chained instruction asking the model to summarise its own system prompt. We log which of the four Azure OpenAI content-filter categories (hate, sexual, violence, self-harm) triggered, at what severity, and whether the trigger fired on input, output, or both.

The thing nobody talks about: default content-filter configuration in Azure OpenAI is not a prompt-injection defence. Prompt Shield is a separate, opt-in layer in the Azure AI Content Safety service, and on every audit we have run in the last eight months, it was off. If your agent can be talked into revealing its system prompt with a single hypothetical, the audit fails this section before the second prompt.

For the 15 prompts: we want 100% coverage on output filtering at medium severity or above, and Prompt Shield enabled on any prompt that consumes retrieved context, which in a RAG agent is all of them.

Region-failover survival from West Europe to North Europe

This is the section most teams have never thought about, and the one that keeps coming up since the European regulatory mood shifted. Three of our Dutch clients sent us versions of the same Slack message in the last quarter: if Microsoft moves a region or the AP pushes back, what happens to our agent?

The honest answer for most of them: it goes down for somewhere between four hours and four days, and the citation trail is gone.

The test we run is concrete. We pick three knowledge-bases, the ones that touch personal data and therefore fall under AVG citation obligations, and we simulate a region failover from West Europe (Amsterdam) to North Europe (Dublin) at 04:00 CET, when the indexer cron jobs and the nightly AI Search merges are both running.

What we are actually checking:

Is the AI Search index replicated, paired, or single-region? Default is single-region. Replication is something you set up, not something Azure does for you.
Are the blob containers that back the indexers in a paired storage account (RA-GRS or GZRS), or are they LRS in a single zone?
Does the agent's citation chain, the URLs it embeds in its answers, point at region-specific resource names? If yes, every cached citation in your conversation history will 404 after failover.
Does the Azure OpenAI deployment have a sibling deployment in the secondary region with the same model version and the same fine-tuning state? Model availability across regions is not symmetric.

Of the three knowledge-bases we pick, the audit passes if at least one would survive a failover with its citation trail intact. In the last twelve months we have audited eleven Dutch SME tenants. Two passed this section. Two.

The AVG citation-trail problem at 04:00

The AVG, the Dutch implementation of the GDPR, does not say the words "RAG citation" anywhere. What it does say, via the Autoriteit Persoonsgegevens's guidance on automated decision-making, is that a data subject has the right to a meaningful explanation of how a decision involving their personal data was reached. For an agent that retrieves and answers, that means the citation trail, which document, which version, which paragraph, has to be reproducible at the moment of explanation, not at the moment of the answer.

This is the failover test's real teeth. If your citation points at https://stwesteurope.blob.core.windows.net/contracts/2024-11-rev3.pdf and the failover renames the storage account, the document is still in Dublin, but the URL in the conversation log is dead. The data subject asks why the agent suggested a particular rate. You cannot show them. That is an AVG problem before it is a Microsoft problem.

Our recommendation in every audit: citations should resolve through an application-owned router, not a direct blob URL. A simple /cite/{doc_id}/{version_hash} route that the agent emits, and that your application resolves to whichever region currently holds the document. Five days of work. Saves the failover conversation forever.

Takeaway

If your RAG agent's citations are direct Azure blob URLs, your AVG explainability story dies the first time Microsoft fails a region. Route citations through your own app.

What the audit costs, and what it changes in the quote

One engineer-day on our side. We deliver a 6 to 8 page memo with the drift table, the content-safety matrix, the failover scorecard, and a list of the eleven other checks we ran without writing about them here. Then we quote the retrofit, or we tell the client the honest version, which is sometimes "this needs to be rebuilt, here is why, and you should get a second opinion before believing us."

The pattern: about a third of audits convert into a retrofit. About a quarter convert into a smaller, narrower piece of work (fix the indexers, leave everything else). About a fifth convert into a re-platform. The rest go back into the drawer until the client has an incident that proves the audit was right.

The five-minute version you can run yourself

You do not need us for the first cut. Open the Azure portal. Go to your AI Search service, pick the indexers blade, and sort by Last run. Anything older than seven days that should be running daily is a finding. Then open Azure OpenAI Studio, go to Content filters, and check whether the deployment you are using has anything other than the default Medium filter on hate, sexual, violence and self-harm, and whether Prompt Shield is on at all.

If both of those are clean, you are in the top quartile of Dutch SME tenants we see. If either is not clean, you have an afternoon's work ahead of you, and a conversation with your DPO before that.

When we built the knowledge-base agent for a Dutch industrial wholesaler last autumn, the thing we ran into was exactly this citation-routing problem: their conversation logs had thousands of dead blob URLs from a storage-account migration nobody had connected to the agent. We solved it by emitting our own citation IDs and resolving them at render time, then backfilling the history with a one-shot script. If you are looking at a similar mess, the AI agents work we do at ABN starts with the audit above and ends with a system you can hand to your auditor without flinching.

If you do nothing else this week, pull the indexer-status JSON from your search service and sort by last successful run. Five minutes. The first thing you find will tell you whether to keep reading or start panicking.

Key takeaway

Before retrofitting an Azure RAG agent, audit index drift, content-safety coverage and citation survival under a West Europe to North Europe failover.

FAQ

How long does the audit take?

One engineer-day on our side, plus roughly half a day of access provisioning on the client's side. The deliverable is a 6 to 8 page memo with the drift table, content-safety matrix and failover scorecard.

Do we need to give you production credentials?

No. Read-only Reader role on the resource group containing AI Search and Azure OpenAI, plus log read on Application Insights or the equivalent. We do not need write access to run the audit.

What happens if our tenant fails the failover section?

Failing the failover section is the norm, not the exception. Two of eleven tenants we audited last year passed. The fix is usually a citation router and paired storage, not a full re-platform.

Is the default Azure OpenAI content filter enough?

It catches the four classic categories at medium severity. It does not stop prompt injection. Prompt Shield is a separate opt-in layer in Azure AI Content Safety and is off by default on every tenant we have audited.

Does AVG actually require a citation trail?

Not in those words. It requires a meaningful explanation of automated decisions involving personal data. For a RAG agent, the citation trail is the only practical way to provide that explanation after the fact.

ragknowledge baseai agentsarchitecturesecurityoperations

Building something?

Start a project