AI agents

Agent hallucinations: the eight-line incident report we file

Friday 16:47. An agent tells a customer their invoice is paid. It isn't. By 17:02 the eight-line incident report is filed and the weekend can start.

Jacob Molkenboer· Founder · A Brand New Company· 19 Mar 2024· 7 min

Creased incident form held by brass relay, green sticky note, stopped pocket watch at 16:47, broken wax seal, message tube.

Friday 16:47. The on-call channel pings. An agent we built for a wholesaler has told a customer their invoice is paid. It is not paid. The customer screenshots the reply and forwards it to their accounts team, who CC the wholesaler's finance lead, who asks us, calmly and in writing, what is going on.

By 17:02 the eight-line incident report is filed in a shared doc. The on-call writes nothing else until Monday. No fix, no patch, no prompt tweak. The agent is paused, the affected customer is emailed by a human, and the weekend starts on time.

That second paragraph is the part most teams skip. We didn't always do it either.

Why we write anything down at all

Hallucinations in production agents are not rare. They are a class of failure with the cadence of a flaky integration test: most weeks none, then three in a day. The reflex is to open the prompt, find the offending instruction, add a "do not say X under any circumstances" line, redeploy, and move on.

That reflex is what kept our agent reliability flat for the first six months we ran agents for clients. We were patching symptoms inside a Friday evening Slack thread that nobody read again. By the time the same failure mode showed up two clients later, no one remembered the first one.

So we borrowed the bones of Google's blameless postmortem format, stripped it down to the smallest thing we could stand to fill in, and forced ourselves to write it every time, before doing anything else.

The eight lines

It is a markdown file in a shared repo. One file per incident, named YYYY-MM-DD-agent-shortname.md. Eight headings, one or two sentences under each. The whole thing takes under fifteen minutes to write and is meant to.

1. timestamp_utc:   2026-05-29T14:47:11Z
2. agent:           dunning-agent v3.4.1 (model: redacted)
3. trigger:         inbound email from finance@acme.example, subject "RE: invoice 8842"
4. output_verbatim: "Your invoice 8842 has been settled on 28 May."
5. ground_truth:    invoice 8842 status = OPEN, last reminder sent 27 May
6. blast_radius:    1 customer, 1 email, forwarded to 3 internal recipients at client
7. containment:     agent paused 14:51, client notified 14:55, customer apology sent 15:08
8. class:           tool-result misread (payments API returned an unrelated paid invoice)

That is the whole document. No analysis, no fix proposal, no blame, no model post-mortem. Just what happened, what was true, who saw it, and what we did in the next sixty minutes.

Line 4 is the only line that is hard. People want to paraphrase. They want to soften it, summarise it, add context. We don't let them. The output goes in verbatim, in quotes, with the surrounding sentence if it changes the meaning. If we can't reproduce the exact string we say so on line 4 ("output not retained, paraphrased from customer screenshot") and that becomes its own backlog item: fix logging.

Line 8 is the one that quietly does the most work. We have a fixed vocabulary of failure classes:

tool-result misread: the tool returned correct data, the agent misinterpreted it.
tool-result trusted blindly: the tool returned wrong or stale data and the agent passed it through.
prompt under-constrained: the agent did something the prompt didn't forbid but should have.
prompt over-constrained: the agent refused or deflected something it should have handled.
retrieval miss: the right document existed in the knowledge base but didn't make it into context.
retrieval false positive: the wrong document came back and was treated as authoritative.
state confusion: the agent mixed up two conversations, two customers, or two threads.
pure fabrication: no tool involved, the model invented a fact end-to-end.

Eight classes, deliberately. They map onto eight different fixes, and they fight the temptation to file every hallucination as "the model lied." The model rarely lies in the sense people mean. Almost every incident, looked at closely, is one of the first seven.

Monday

On Monday morning we open the week with thirty minutes on the incident log. Whoever was on-call walks the room through each new report. Three questions, in order:

Is this a new class for this agent, or have we seen it before?
If we have, is the rate accelerating?
What is the smallest change that would have caught this in staging?

That third question is the load-bearing one. It is also where most teams go wrong, because the honest answer is almost never "a better prompt." A better prompt is the cheapest fix and the least durable one. Four years of running these agents, the durable fixes cluster into four buckets, and the bucket is decided by line 8.

Takeaway

Classify the failure before you fix it. The class on line 8 picks the bucket on Monday. Skip the classification and you will reach for a prompt edit every time, because prompt edits feel like progress.

Bucket one: evals

For pure fabrication and prompt under-constrained, we add an eval. A frozen test case in our eval suite that exercises the exact path that failed, with the exact inputs, against the current agent version. If the eval passes today but fails next week, we want to know the day it starts failing. If we don't add the eval, we are telling ourselves the model is fine, which is a wish, not a measurement.

Bucket two: guardrails

For tool-result trusted blindly and state confusion, we don't trust the model to fix itself. We add a deterministic check outside the agent loop. For the dunning incident above, the guardrail is roughly fifteen lines of Python: before any outbound message that contains "settled," "paid," or "received," re-query the payments API for that specific invoice ID and assert the status. If the assertion fails, the message is held and a human is paged. The agent never gets a second chance to lie about a paid invoice.

PAID_WORDS = ("settled", "paid", "received", "cleared")

def guardrail_payment_claim(message: str, invoice_id: str) -> None:
    if not any(w in message.lower() for w in PAID_WORDS):
        return
    status = payments_api.get_invoice(invoice_id).status
    if status != "PAID":
        raise HoldForReview(
            f"agent asserted paid, API says {status} for {invoice_id}"
        )

Bucket three: retrieval

For retrieval miss and retrieval false positive, we look at how the document was indexed, not how it was generated against. The failure is almost always in chunking, embedding, or a metadata filter that quietly excluded the right document. Fixing the prompt at this point is rearranging deckchairs. We re-run the retrieval in isolation, with the original query, and watch which documents come back and in what order. The fix lives in the index, not in the prompt.

Bucket four: the prompt, finally

Only prompt over-constrained gets fixed in the prompt, and even there we are careful. An over-constrained prompt is usually the scar tissue from an earlier incident that was fixed by adding a "never do X" line. Removing those lines without re-running the original eval is how you regress to the failure that caused them. So bucket four always loops back to bucket one: change the prompt, add the eval that proves the old failure stays fixed.

What we don't do

We don't ship a fix on Friday. We don't retrain or swap models based on a single incident. We don't argue with the customer about whether the output "really counts" as a hallucination. We don't write a long Slack post explaining how it happened. The eight lines are the explanation.

We also don't pretend the model is going to stop doing this. The AI Incident Database has been collecting public agent failures for years and the curve does not bend down on its own. What bends is the gap between the incident happening and the team noticing, and the gap between noticing and a durable fix landing. Both of those are process problems, not model problems.

If your agent's incident log is a Slack channel, you don't have an incident log. You have a venting channel. Move it to a file in a repo this week.

The Monday after

By the Monday after the dunning incident, the guardrail was in, the eval was green, and line 8 had a new entry in the team's running tally: tool-result misread, now five incidents over four agents. That tally is what triggered us to add a generic "verify status before asserting status" wrapper to every agent that talks to a transactional API. One incident is a story; five is a pattern; a pattern earns a platform fix.

When we built the dunning-agent for that wholesaler, the thing we ran into was that the payments API returned the most recently paid invoice when queried with a malformed ID, instead of returning an error. We solved it in the guardrail layer, not in the prompt, and the same wrapper now sits in front of every one of our AI agents that touches money.

If you run a production agent and don't have an incident format yet, copy the eight lines above into a file called incidents/README.md in your agent's repo this afternoon. Next Friday at 16:47, you will be glad it is there.

Key takeaway

Classify the hallucination before you patch it. Line 8 picks the bucket, and the bucket decides if the fix is an eval, a guardrail, retrieval, or a prompt.

FAQ

How long should writing an incident report take?

Under fifteen minutes. If it takes longer, the format is too detailed. The point is to capture what happened before the urge to fix it overwhelms the urge to record it.

Why pause the agent on Friday instead of fixing it?

A Friday-night fix on a tired engineer is how regressions get shipped. Pausing the agent costs less than another bad message; the real cost shows up Monday when the room is fresh.

What goes on line 8 if we genuinely don't know the class?

Mark it 'unknown' and treat figuring out the class as the Monday task. An unknown class usually means your logs aren't capturing enough to reconstruct the call.

Do we share the incident report with the affected customer?

No. The eight-line report is internal. Customers get a separate, human-written apology with whatever specifics they need. The report exists to make the next fix better.

ai agentsoperationsworkflowarchitectureprocess automation

Building something?

Start a project