AI agents
LangGraph reliability audit: the checklist before a retrofit
At 03:30 a LangSmith region flips and your invoice-chase agent goes silent. We audit retry drift, prompt pinning, and failover before quoting.

It is 03:30 on a Tuesday. A founder in Eindhoven gets a Slack ping from her on-call engineer: the customer-success agent has stopped replying. The LangSmith status page is yellow on us-east-1. The agent itself is fine, running on her own infra in Frankfurt, but every retry that should have called get_order_status is now silently looping against a tracing endpoint that no longer accepts writes. By 04:10 the queue is 8,000 deep and the morning shift will inherit it.
This is the failure mode we look for first when a sub-€20M Dutch SME asks us to make their agent stack more reliable. Before any retrofit work begins, we run the same five-page audit on whatever LangGraph and LangSmith setup is already in production. What follows is what is on it, why each item is there, and what the scores actually mean when the PDF lands at the end of the week.
What the audit measures
Three things. Nothing else. A reliability retrofit that tries to fix everything tends to fix nothing. Most teams we walk into already have a backlog of we should clean that up tickets the size of a phone book. We pick the three failure modes that, across the fourteen agents we currently have in production, account for the overwhelming majority of weekend-killing incidents.
- Tool-call retry drift across the top 30 nodes by call volume.
- Prompt-version pinning on the top 12 sub-agents by traffic share.
- Region-failover survival for the three workflows the business genuinely cannot lose overnight.
The cut-offs are not arbitrary. On a Pareto plot of LangSmith trace counts for almost every SME stack we have audited, the long tail flattens at roughly node thirty and sub-agent twelve. Above the cut, drift compounds quickly because the same nodes are called by half the graph. Below it, the blast radius of any single failure is small enough that fixing it this quarter would be a waste of the budget.
Tool-call retry drift, scored
A node has drift when the same input, retried, produces a different tool call. This is almost never the model's fault. It is usually one of four causes: a non-pinned tool schema that mutated between calls, a sampling temperature above zero on a node the author believed was deterministic, a retry policy that re-runs the LLM call instead of just the tool, or a structured-output validator that quietly coerces a field on retry number two and then never again.
We pull the last 30 days of traces for the top 30 nodes from LangSmith and bucket every span by (node_id, input_hash, retry_index). Every (node, input) pair that produced more than one distinct tool-call signature across retries earns a drift point. We weight by traffic, normalise to a 100-point scale, and call anything above 12 a hard fail. Anything between 5 and 12 is a soft fail with a one-line note about which nodes carry the weight.
The fix on a failing node is almost always one of three things, applied in this order: pin the tool schema, drop temperature to zero where the node is doing classification or routing, and narrow the retry scope so it wraps the tool call rather than the LLM step that produced it.
from langgraph.graph import StateGraph
from langgraph.pregel import RetryPolicy
# Retry the tool call, not the LLM that produced it.
graph.add_node(
"lookup_invoice",
lookup_invoice_tool,
retry=RetryPolicy(
max_attempts=4,
retry_on=(TransientError,), # never on ValueError
initial_interval=0.5,
backoff_factor=2.0,
jitter=True,
),
)
If your retry policy wraps the model call, you are not retrying the same tool call. You are sampling a new one. That is drift by construction, and it is the single most common pattern we find on day one. The LangGraph team spells the distinction out in their node retries guide, but most teams we audit have copied a tutorial that wraps the whole node and never revisited it.
Prompt-version pinning, the matrix
LangSmith's prompt hub is excellent and dangerous in equal measure. Excellent because a non-engineer can edit a sub-agent's system prompt without redeploying. Dangerous because a non-engineer can edit a sub-agent's system prompt without redeploying.
For each of the top 12 sub-agents we ask one question: is the prompt being pulled by name (latest) or by commit hash (pinned)? Anything pulled by name fails the audit. We do not care how disciplined the team currently is. We care that the rollback path on a 03:30 incident is git revert, not does anyone remember what the prompt looked like yesterday afternoon?
from langsmith import Client
client = Client()
# Fails the audit: silently follows whoever last clicked Save.
prompt = client.pull_prompt("invoice-chaser/system")
# Passes the audit: pinned to a commit, behaviour is reproducible.
prompt = client.pull_prompt("invoice-chaser/system:6f3a91b")
If your CI does not fail when a sub-agent references an unpinned prompt, the audit will. We grep the repo for pull_prompt( calls with no colon in the first argument and list every offender by file and line.
The matrix we hand back has one row per sub-agent and four columns: prompt name, currently pinned commit, latest commit on the hub, and number of unpinned production calls in the last 7 days. Anything non-zero in the last column is what gets fixed first. The cost is almost always a few hours of code changes and a CI rule. The benefit is that a midnight prompt experiment by a well-meaning colleague stops being a P1 incident.
Three workflows that have to survive a region flip
LangSmith runs in two regions: us-east-1 and eu-west-1, with the EU region marketed for AVG-relevant traffic. For Dutch SMEs handling personal data, the EU region is not optional. It is the only configuration that keeps trace payloads inside the EEA without writing a transfer-impact assessment that most operations teams are not in a position to produce.
The question is therefore not are we on the EU region. That is table stakes. The question is: if eu-west-1 itself goes degraded at 03:30, which workflows can finish their current job, buffer their traces locally, and replay them once the region comes back without losing any of the history that the AVG and your DPO will ask for on Monday morning?
We pick three workflows for survival certification. Three, not all of them. The shortlist criteria, in order:
- The workflow runs unattended outside business hours.
- A silent failure costs the business money or trust the same night.
- The trace is legally required to be retrievable for at least 30 days.
For most of our clients that shortlist is the invoice-chase agent, the inbox-triage agent, and whichever customer-facing chat agent runs 24/7. For a logistics client last year it was the route-replan agent, the SLA-breach watcher, and the night-shift dispatcher hand-off. The shape of the workload changes; the principle does not.
The survival pattern is the same for all three: a local trace buffer on disk, a tight client-side timeout on the LangSmith call, and an explicit replay job. Stripped down to the parts that matter:
import os, json, pathlib
from langsmith import Client
BUFFER = pathlib.Path("/var/lib/agent/trace-buffer")
BUFFER.mkdir(parents=True, exist_ok=True)
client = Client(
api_url=os.environ["LANGSMITH_ENDPOINT_EU"], # https://eu.api.smith.langchain.com
api_key=os.environ["LANGSMITH_API_KEY"],
timeout_ms=750, # fail fast; tracing must never stall the agent
)
def safe_emit(run: dict) -> None:
try:
client.create_run(**run)
except Exception:
# AVG-safe: persist locally, replay later, never drop.
(BUFFER / f"{run['id']}.json").write_text(json.dumps(run))
The replay job runs every five minutes and walks the buffer in write order. We benchmark it on the client's own hardware before signing off: on a typical SME workload it can drain a six-hour outage in under twelve minutes once the region is healthy again. The point is not the throughput. The point is that no trace is ever lost, which is what Article 30 of the AVG and a future auditor actually want to hear.
If your agent cannot finish its job when LangSmith is down, LangSmith is not your tracing backend. It is your single point of failure, and you are paying for the privilege.
What is deliberately not on the checklist
We do not audit model choice. We do not audit cost per run. We do not score the system prompt for quality. Those are interesting questions and we will happily argue about them over coffee, but they are not reliability. They are product decisions, and they belong in a different conversation with different people in the room.
We also do not benchmark eval coverage at this stage. Evals matter, but a team whose retries drift and whose prompts are unpinned will get noisy eval results that everyone in the channel quietly learns to ignore. Fix the substrate first. Then the evals start telling you something true, and the team starts trusting them again.
When this audit ran against a B2B SaaS workload in Utrecht last quarter, the retry-drift score came back at 38 out of 100 on a system the team believed was rock-solid: one high-volume node had wrapped a fully deterministic SQL tool in a generic LLM-retry decorator inherited from a tutorial, and a single commit closed the gap. That is the kind of unglamorous fix our AI agents work moves the score on most.
If you want to start before you talk to anyone: pull the last 30 days of LangSmith traces for your three highest-volume nodes, group them by input hash, and count how many distinct tool-call signatures each input produced. If the answer is more than one for more than 5% of inputs, you have drift. That is ticket number one.
Key takeaway
Audit retry drift, prompt pinning, and region failover before paying for an agent reliability retrofit — the rest is product, not reliability.
FAQ
How long does the audit take from kick-off to delivered PDF?
Four working days for a stack of up to ~80 nodes and a dozen sub-agents. Larger graphs scale roughly linearly, but we will flag that on the intake call before we start the clock.
Do you need direct access to our production LangSmith workspace?
Read-only access to traces for the last 30 days, plus read access to the prompt hub. We do not need keys for the underlying models or your application database to run the audit.
What if we are still on the LangSmith us-east-1 region?
Migrating to eu-west-1 becomes item one of the retrofit. We can run the audit against the US region, but no Dutch SME handling personal data should stay there once the audit is done.
Can our own engineers run this checklist without you?
Yes. The three measurements above are the whole method. Most teams who try it report the prompt-pinning pass takes a morning and the retry-drift scoring takes a day of careful trace analysis.