← Blog

AI agents

Agentic reliability: twelve gaps that hit production first

It's Monday morning at a Dutch logistics firm and the new invoice-chase agent has sent 47 duplicate reminders over the weekend. Same gap as every rollout — and eleven more behind it.

Jacob Molkenboer· Founder · A Brand New Company· 22 Jun 2026· 10 min
Brass telephone switchboard with twelve braided patch cables on ivory paper, one green tag, red wax seal on a folded docket.

It's Monday morning at a thirty-person logistics firm in Eindhoven. The operations manager opens her inbox and finds 47 reminders the new invoice-chasing agent has sent to the same client over the weekend — every six minutes, polite as ever. The client has replied twice asking her to stop.

The agent didn't malfunction. It did exactly what it was told. A tool call to the email service timed out at thirty seconds, the orchestrator marked the step as failed, the retry policy fired, and on the next loop the model decided the invoice still hadn't been chased. The send actually went through every time. There was no idempotency key.

We see this pattern at almost every Dutch SME that puts its first agentic system in production. The bugs are not exotic. They are the same twelve gaps in roughly the same order. This is the field-guide we hand new clients on day one.

How to read this list

The twelve gaps fall into three tiers, ordered by what it costs your team to close them.

  • Tier 1 — an ops manager and one engineer close these with a single retry policy and an afternoon of testing.
  • Tier 2 — needs structural plumbing: idempotency keys, dead-letter queues, schema validation. One good weekend.
  • Tier 3 — requires a full tool-call replay log. You will need this anyway the moment a customer or auditor asks "what did the AI actually do on 14 March at 03:17?". If you are heading for SOC 2 or ISO 27001, start here.

The order matters more than the count. A team that closes Tier 1 and Tier 3 while skipping Tier 2 looks fine for a week and then ships a 47-duplicate-email morning. A team that closes Tier 1 and Tier 2 ships happily until the first audit request lands. The tiers compound; skipping one breaks the next.

Tier 1 — the five gaps one retry policy closes

These are the boring ones. They are also responsible for most of the late-evening calls we get from clients in the first month of production.

1. Transient 5xx from the model provider

Anthropic, OpenAI, and Google all return 503s under load. This is not a bug in your code. It will happen on a Tuesday afternoon at 14:30 CET when half the European tech industry is also calling the same endpoint. Retry with exponential backoff and jitter. Cap at four attempts, then fail loudly. The provider docs are explicit about how to back off — follow them.

2. Tool-call timeouts

Your CRM's REST endpoint usually answers in 200 ms. Once a week it takes nine seconds. If your agent harness has a five-second timeout, the model concludes the call failed and tries something creative. Set tool-call timeouts to the 99th-percentile latency of the underlying API, plus a margin. Then retry. Then fail.

3. Rate-limit 429s

The provider tells you exactly how long to wait in the Retry-After header. Honour it. We have lost count of how many orchestrators we have seen that swallow this header and retry immediately, doubling the problem on the next pass.

4. Auth token expiry mid-loop

An agent loop that runs for ninety seconds can outlast a short-lived OAuth token. The token was valid when the loop started, expired by the third tool call. Refresh on 401, retry the call once, only then fail. Most token middleware libraries already do this — make sure yours is one of them, and test it by manually expiring a token in staging.

5. JSON-schema parse failures

The model returns a tool call where the amount field is the string "€1.250,00" instead of the number 1250. Validate against your schema, and on failure return the validation error to the model with a clear instruction. It will self-correct on the second pass nine times out of ten. The tenth time, log the raw output and route to a human; do not let the loop keep guessing.

A reasonable retry policy that covers all five looks roughly like this:

async function callTool(name: string, args: unknown, attempt = 0): Promise<ToolResult> {
  try {
    const validated = ToolSchemas[name].parse(args);
    return await withTimeout(tools[name](validated), p99(name) * 1.5);
  } catch (err) {
    if (attempt >= 4) throw err;
    if (err instanceof ZodError) {
      return { kind: "validation_error", message: err.message };
    }
    if (err.status === 429) {
      await sleep(parseRetryAfter(err) ?? backoff(attempt));
      return callTool(name, args, attempt + 1);
    }
    if (err.status >= 500 || err.code === "TIMEOUT") {
      await sleep(backoff(attempt));
      return callTool(name, args, attempt + 1);
    }
    if (err.status === 401) {
      await refreshToken();
      return callTool(name, args, attempt + 1);
    }
    throw err;
  }
}

Roughly forty lines and it closes the first five gaps. If your stack does not have something like this around every tool call, write it before you do anything else.

Tier 2 — the four gaps a retry policy quietly makes worse

This is where most teams get hurt. A retry policy without the next layer of plumbing turns a one-off blip into a multiplied disaster. The 47-duplicate-invoice morning is one of these.

6. Duplicate side-effects

Every action that touches the outside world — send email, create invoice, post to Slack, call a webhook — needs an idempotency key. Generate it deterministically from the action: hash the recipient, the invoice number, the date. The underlying API may already de-duplicate on a key you pass; Stripe, Mollie and SendGrid all do. If it doesn't, write a thin ledger of your own.

Warning

If your agent can send the same email twice, it will. Not "might" — "will". The only question is whether you catch it before the customer does.

7. Stuck workflows

A tool call hangs. Your timeout catches it. But the workflow state machine never updates because the in_progress row in the database is still there. Three days later you discover the queue has 600 stuck jobs and nobody noticed. Use a watchdog process that scans for rows older than N times the expected duration and either retries them or moves them to a dead-letter queue for a human.

8. Tool-argument hallucination

The model fills in the recipient as jansen@klant.nl when the actual address is j.jansen@klant.nl. The schema validates. The send fails — or worse, succeeds, to the wrong person at the wrong company. Pin tool arguments to entities you have already retrieved. If the agent wants to email a customer, the tool should accept a customer ID, not an arbitrary address. Your orchestration code resolves the ID to a verified address. The model never types an email address by hand.

9. Two-system writes with no rollback

Your agent updates the CRM, then tries to post a Slack notification, then the Slack call fails. The CRM is now out of sync with what the rest of the team thinks happened. Use a saga pattern with compensating actions, or, easier, funnel all side effects through a single event log that downstream systems consume. The agent writes one row; everything else is a projection.

Tier 3 — the three gaps SOC 2 will surface

These cannot be patched. They have to be designed in. If you plan to sell to a regulated buyer, a bank, or a publicly-listed customer in the next eighteen months, you need them from day one. Retrofitting costs roughly five times more than building them upfront, because the replay log has to be reconciled with side effects you have already shipped.

10. Reconstructing what the agent actually did

A customer calls. "Your AI cancelled my subscription on 14 March. I did not ask for that." Can you, in under fifteen minutes, produce a complete log of every model call, every tool call, every input, every output, and the human approval (or lack of one) that led to the cancellation? If not, you have no audit trail. SOC 2's CC7 control family will ask for exactly this, and "we'll dig through Sentry" will not pass.

The minimum schema we ship for the replay log: timestamp, request ID, parent span ID, model version, prompt hash, input arguments (PII-stripped), full output, latency, retry count, and the outcome of any side-effect call this row triggered. Append-only. One row per tool call. Stored on object storage with versioning and a legal-hold lock — S3 with Object Lock works; Cloudflare R2 with the equivalent setup also works. Any agent decision is reconstructable from the log alone, even after the orchestrator code has been rewritten twice.

11. PII in prompts and responses

The agent's context window contains customer names, email addresses, sometimes payment data. Every one of those tokens is now in the model provider's logs, your replay log, your observability tool, and probably your Sentry traces. Under GDPR Article 32 you are accountable for all of them. Two practical moves: strip PII before logging (tokenize, store the mapping in a separate KMS-encrypted table), and switch on the provider's zero-retention mode where it is available.

12. State drift after an incident

The agent ran for six hours during an outage where your CRM was returning stale reads. Half its decisions are based on wrong data. After the incident, you need to know which decisions to roll back. Without a replay log keyed to the source-system version at the time of each read, you cannot. With one, you can replay the agent against the current data, diff the outputs, and act only on the divergent rows. The same machinery is what lets you regression-test a new model version against last month's traffic: rerun the trace, look only at the cases where the action diverges, decide whether the divergence is an improvement or a regression.

The order to attack this in

Tier 1 in week one. Tier 2 before the second use-case ships. Tier 3 before the first regulated customer, or the first agent that can move money. We have never seen a team regret doing Tier 3 early; we have seen several regret postponing it.

Two things to put in your runbook before any of this matters. First, name a single human as the owner of every model-provider account, with payment method, billing alerts and rate-limit increases all routed to them. The day a junior engineer leaves and the provider account is in their personal Gmail, the rollout stops. Second, prefer short-lived, scoped credentials for every third-party tool the agent calls. Long-lived API keys baked into the orchestrator are how a fair share of the next decade's incident post-mortems will open: "the contractor who set up the integration in 2024 had no two-factor on the key vault."

Where this list came from

When we built the AR-chasing agent for a Rotterdam wholesaler last quarter, the thing that nearly killed the rollout was gap #6. The retry policy was solid. The idempotency key was not. We ended up moving key generation upstream of the agent itself — the orchestrator stamps a deterministic key before the model ever sees the task, so a retry from any layer hits the same key and the same row. The model can now call the send-email tool a hundred times in a single loop; the second through hundredth all return the original send's receipt, and the controller's inbox stays quiet.

If you are about to ship your first AI agent into production, draw your tool-call graph on a whiteboard this afternoon and write the failure mode next to every arrow. The twelve gaps will find you in roughly the order above. Decide today which ones you close before go-live, and which ones you accept as known risk until Tier 3 lands.

Key takeaway

A retry policy buys you a quiet first week. A replay log buys you the next two years.

FAQ

What's the smallest reliability change to make before sending an agent to production?

Wrap every tool call in a retry policy that handles 5xx, 429, 401, timeouts, and schema-validation errors. Cap at four attempts. That alone covers most first-month incidents.

Do I really need a tool-call replay log if I'm not pursuing SOC 2?

Yes, the first time a customer disputes an agent action. Audit is one buyer of replay logs; debugging, incident response, and adding new tools all use the same data.

Should the model generate idempotency keys, or should the orchestrator?

The orchestrator, upstream of the model, deterministically hashed from the action's identifying fields. Models hallucinate keys. Deterministic hashes don't. The model never sees the key.

Is zero-retention mode at the model provider enough for GDPR compliance?

It covers one hop. You still need PII stripping before your replay log, observability tools, and any error tracker. Provider zero-retention does not absolve the rest of your stack.

ai agentsautomationprocess automationoperationsarchitectureintegrations

Building something?

Start a project