AI agents

AI agent audits: nine Stripe refunds taught us a checklist

An agent we shipped re-ran the same Stripe refund nine times against a real customer balance after a checkpoint crash. Here is the audit we run now.

Jacob Molkenboer· Founder · A Brand New Company· 12 Jun 2026· 8 min

Brass relay beside nine overlapping carbon receipt slips, one tagged with a green sticky tab, brass bell and ledger card.

The Stripe dashboard showed nine refunds against the same charge, two minutes apart, all of them issued by an agent we had shipped. The customer had been charged 47 EUR. By the time the on-call engineer hit pause, we had returned her 423.

That is the kind of bug you only get to file once. After we wrote the customer an honest email and reversed the duplicates, we sat down and wrote a checklist. Every agent we have shipped since then walks through it before it gets a connection string for a client's production Postgres.

This post is that checklist.

The actual failure

The agent was a refund handler for a small e-commerce client. A human customer-service operator approved a refund in a back-office tool. The agent, built on LangGraph with the default SQLite checkpointer, walked through three steps: load the order, decide the refund amount, call Stripe.

The third step crashed. Stripe's API returned a 503 during a routine incident. Our retry policy, written in a hurry six months earlier, treated 503 as transient and retried the entire node. The LangGraph checkpointer, by design, persists state on successful step completion. A crash mid-step means the resume restarts that step from the top. The step had no idempotency key on the Stripe call. So every retry, plus every checkpoint resume after the process was restarted, issued a new refund.

The reading we want you to take from this is not "LangGraph is broken." LangGraph behaves exactly as documented. The reading is: a checkpointed agent that does external work needs the same care a payment processor itself needs. That is true of Burr, Temporal, and any other framework whose persistence model is at-least-once. Which is almost all of them.

Side effects in their own table

Every external side effect should be recorded in your own database before you make the call, with a deterministic id you can look up on retry. This is the transactional outbox pattern, only inverted: we write our intent first, then we call the outside world, then we write back the result.

def issue_refund(state):
    op_id = f"refund:{state['order_id']}:{state['attempt_token']}"
    with db.transaction() as tx:
        row = tx.execute(
            "select stripe_id, status from refund_ops "
            "where op_id = %s for update",
            (op_id,),
        ).fetchone()
        if row and row["status"] == "succeeded":
            return {"stripe_id": row["stripe_id"]}
        if not row:
            tx.execute(
                "insert into refund_ops (op_id, status) "
                "values (%s, 'pending')",
                (op_id,),
            )

    refund = stripe.Refund.create(
        charge=state["charge_id"],
        amount=state["amount_cents"],
        idempotency_key=op_id,
    )

    with db.transaction() as tx:
        tx.execute(
            "update refund_ops set stripe_id = %s, status = %s "
            "where op_id = %s",
            (refund.id, refund.status, op_id),
        )
    return {"stripe_id": refund.id}

Two things make this safe. The op_id is deterministic from the agent's state, so a resume produces the same key. And we pass it as Stripe's idempotency_key, which means even if our own table check loses a race, Stripe will return the original refund object instead of creating a new one. Stripe documents this contract clearly, including the 24-hour window in which the guarantee holds (Stripe API: idempotent requests). For long-running agents that matters.

The same shape applies to email sends, webhook publishes, file writes, and DDL. Every action with a side effect gets a row in your database before the action runs.

Database roles, not just connection strings

The agent that issued nine refunds connected as a user with SELECT, INSERT, UPDATE, DELETE on every table in the public schema. It needed INSERT and UPDATE on two tables. We learned the hard way: an agent should connect as a role that can only do what its tools say it can do.

CREATE ROLE agent_refund_writer LOGIN PASSWORD '...';
GRANT CONNECT ON DATABASE app TO agent_refund_writer;
GRANT USAGE ON SCHEMA public TO agent_refund_writer;

GRANT SELECT ON orders, customers
  TO agent_refund_writer;
GRANT INSERT, UPDATE ON refund_ops, agent_audit_log
  TO agent_refund_writer;

REVOKE CREATE ON SCHEMA public FROM agent_refund_writer;
ALTER ROLE agent_refund_writer SET statement_timeout = '5s';
ALTER ROLE agent_refund_writer SET lock_timeout = '2s';

The role can read two tables, write to two tables, cannot create objects, cannot run a statement for longer than five seconds, and will not wait more than two seconds for a lock. Postgres role design (PostgreSQL: database roles) is older than any agent framework. Use it.

For agents that are nominally "read only," we still create a dedicated role with SELECT on a whitelist of tables. "Read only" without an audit role is a polite suggestion. Production grants are not the place for polite suggestions.

Tool schemas as the surface area

Every tool an agent can call is API surface, and you should treat it like API surface. The audit requires:

A JSON Schema for every tool input, with no additionalProperties allowed.
A handwritten allowlist of values for any free-form string that ends up in a SQL identifier, file path, URL, or shell argument.
A maximum length for any free-form string that ends up in a prompt, a database column, or an outbound email body.

The free-form string rule is the one teams skip. An agent that can write any customer email gets phished into writing a refund-approval email to itself within a week. We have seen this. The recent thread about an agent that scanned the DN42 network until it ran its operator's account dry is a different shape of the same disease: unbounded inputs producing unbounded actions.

Resume semantics, declared upfront

LangGraph's checkpointer, Burr's persistence layer, Temporal's workflow history. They all have well-defined resume behavior, and that behavior is almost never "exactly once." It is at-least-once. The first question on the audit is: which nodes in this graph are safe to run twice?

For each node, we annotate it in a comment as one of three categories:

pure: deterministic, no external calls. Safe to re-run.
idempotent: external calls, guarded by an op_id. Safe to re-run.
unsafe: must not re-run. Wrap it.

If a node is left tagged unsafe, the agent does not ship. We refactor it into idempotent, or we move the side effect into a queue with its own idempotency story. The audit step is a grep: any unsafe tag in the codebase fails the build.

A kill switch in writable infrastructure

The agent that issued nine refunds was killed by kubectl delete pod. Between the operator noticing the duplicates and Kubernetes terminating the process, two more refunds went out. Process termination is not a kill switch.

A kill switch is a row in a table the agent reads at the top of every step.

def check_kill_switch(state):
    row = db.execute(
        "select paused, paused_reason from agent_controls "
        "where agent_id = %s",
        (state["agent_id"],),
    ).fetchone()
    if row and row["paused"]:
        raise AgentPaused(row["paused_reason"])

The on-call engineer flips the row from psql. The agent stops before its next side effect. The same pattern handles per-customer pauses, per-action pauses, and full shutdown. Cost is one extra query per step. Worth it.

Warning

If your kill switch lives anywhere the agent does not check on every step, it is not a kill switch. It is a hope.

Cost and rate ceilings

An agent's cost is unbounded unless something bounds it, and that something is not the agent. For every shipped agent, we set:

A hard model-token budget per run, enforced in the wrapper around the model client.
A hard tool-call count per run.
A hard outbound-API spend ceiling, where applicable, enforced by checking a running total in Postgres before each call.

We also rate-limit the wrapper at the process level. An agent that retries a refund nine times in two minutes is not "a refund agent." It is a denial-of-wallet attack on its own customer. The retry policy is part of the agent. So is the ceiling on what the retry policy is allowed to cost.

The dry run that mirrors production

The last item on the checklist is the one that catches what the other items miss. We point the agent at a copy of production data, with a Stripe test key and a throwaway Postgres role, and we run it against the last 200 real cases the human team handled. We diff the actions the agent proposes against the actions the team actually took. A senior engineer reads the diff by hand.

This is slow. It takes half a day per agent. It has caught two cases of an agent refunding the wrong customer (a join condition bug that no unit test would have hit), one case of an agent writing Dutch email to a Vietnamese customer (a locale lookup that defaulted to nl-NL), and one case of an agent confidently approving a chargeback because the prompt contained the word "please" twice. None of these were caught by the test suite. All of them were caught by the diff.

The diff is also where you find out whether your agent is too eager. Frameworks today reward proactivity. Production rewards restraint. Read the diff and find out which one you shipped.

The smallest thing you can do today

Open one of your agents. Find the node that calls the outside world. Ask: if this node runs twice, what happens? If the answer is "I don't know," that is the audit. Start there.

When we built the refund agent for that e-commerce client, the bug that taught us this checklist was the checkpoint-resume one. We solved it by moving every side effect behind an op_id row and a Stripe idempotency key, and by writing the rest of the list above so the next agent we shipped did not need its own incident to find them. If you want a second pair of eyes on the AI agents you have in production, or in flight, that is the work we do.

Key takeaway

An agent that touches production needs idempotency keys, a scoped Postgres role, a row-level kill switch, and a 200-case dry run before it ships.

FAQ

Why did the LangGraph checkpointer replay the refund step?

Checkpointers persist state on successful step completion. A crash mid-step means the resume starts that step from the top, which is at-least-once. Without an idempotency key, the side effect runs again.

Is a Stripe idempotency key enough on its own?

It covers the duplicate-call window (24 hours), but not the case where your own database does not know a call succeeded. Pair the idempotency key with a row in your own table written before the call.

What database role should an agent use?

A dedicated Postgres role with grants on a whitelist of tables, no CREATE on the schema, and short statement and lock timeouts. Never reuse the application's main role for an agent.

How do you actually stop a running agent in an incident?

A row in a controls table that the agent reads at the top of every step. Process termination is too slow and leaves whatever step was already in flight to finish, often with another side effect.

ai agentsautomationarchitecturesecurityoperationsintegrations

Building something?

Start a project