← Blog

AI agents

Tokenizer drift: anatomy of a 7-hour AI agent outage

A vendor pushed a new tokenizer on a Tuesday. By 11pm our contract-review agent was confidently citing clauses that did not exist. Seven hours later, here is the postmortem.

Jacob Molkenboer· Founder · A Brand New Company· 8 Jun 2026· 10 min
Brass relay on ivory paper diagram, green sticky note beneath, broken red wax seal, side window light.

The 11pm Slack ping

A senior partner at a Den Haag commercial firm was reviewing the agent's overnight output when she flagged it. The agent had summarised a supplier termination clause as "Article 12.4(b) requires 60 days notice." She checked the contract. Article 12.4(b) was about indemnification. The notice clause was at 14.1. The agent had not misread it. It had invented it.

The Slack ping came in at 23:11 Amsterdam time on a Tuesday. By 06:30 the next morning we had pulled the agent offline, traced the bug, re-embedded 4,127 contracts and shipped a fix. Seven hours. The cause was not the model. It was a tokenizer.

What the agent was supposed to do

The client is a legal SaaS that helps mid-cap Dutch companies review supplier contracts before signing. The agent reads a PDF, finds the clauses that matter (termination, IP, liability cap, governing law, data processing) and produces a short risk summary with article-number citations. A clerk skims the summary; a partner signs off; a lot of billable hours get redirected to actual lawyering.

The firm runs about 320 supplier contracts a month through the pipeline. At baseline the agent flags risk language a junior reviewer would miss and surfaces a citation count a partner can spot-check in under three minutes. The economic case is simple: every contract the agent reviews correctly is roughly two hours of associate time the firm gets back. That economic case rests on the citation being right.

The pipeline:

  1. OCR plus layout parse with a Dutch-tuned parser.
  2. Chunk into roughly 500-token blocks with 60-token overlap.
  3. Embed with the vendor's text-embedding model.
  4. Store in a Postgres + pgvector index.
  5. For each clause category, retrieve top-k chunks and pass them to Claude with a strict citation prompt.

The citation prompt was clear: "Cite the article number exactly as it appears in the source text. If the article number is partial or unclear, return UNCLEAR." That line had worked in evals for six months. Then a Tuesday morning vendor release moved the floor under it.

Where the chunking broke

Dutch contracts number clauses densely: Artikel 7.2.3 (a), sub (i), sub (B). The tokenizer we used for chunking treated 7.2.3 as four tokens: 7, ., 2, .3. That had been stable since launch.

On the Tuesday in question, the vendor pushed a tokenizer revision. It now treated long number-dot sequences in Latin scripts as a single subword. 7.2.3 became one token. Two consequences:

  • Every chunk shifted by a few tokens. The boundary that used to land between paragraphs now landed mid-clause.
  • Article references at the start of a paragraph were sometimes orphaned at the tail of the previous chunk: ... met inachtneming van Artikel on chunk 47, then 7.2.3 verklaart de Leverancier ... on chunk 48.

Retrieval returned chunk 48. The model saw a sentence that started with 7.2.3 verklaart and no leading article context. It looked at the surrounding text, recognised a discussion of termination, and wrote down what a plausible Dutch supplier contract usually says: Article 12.4(b), 60 days notice. The prompt's UNCLEAR escape hatch did not fire because the article number was not partial in the input it saw. It was wrong in a confident-looking way.

Warning

If your RAG pipeline reads numbered references (legal articles, RFC sections, ICD codes, SKU strings), a 2% shift in token boundaries can move every reference in your corpus across a chunk wall. The model will not refuse. It will guess.

Three monitors that missed it

We had three layers of monitoring and the bug walked through all of them.

Layer one was a fixed eval set of 240 contracts with hand-annotated correct citations. It ran on every model upgrade. It did not run on tokenizer-only changes because we had never treated the tokenizer as a separate dependency. It lived inside the embedding library we pinned by major version.

Layer two was production drift detection. It watched output length distribution, refusal rate, clause-category coverage and a citation-shape check that asserted every cited article matched the local pattern (digit, dot, digit, optional letter, optional sub-letter). The output looked normal. Citation shapes passed the regex perfectly: 12.4(b) is exactly what a real Dutch article number looks like. Refusal rate did not move. The agent was confidently wrong in a way that no aggregate metric could see, because the metric was checking syntax and the bug was about reference.

Layer three was the human review queue. The firm reviewed 5% of outputs by sample. The Tuesday queue ran late. By the time the partner noticed at 23:11, eleven contracts had gone out with bad citations. Three of them had already been forwarded to clients.

The seven hours

Rough timeline, reconstructed from the incident channel:

  • 23:11. Partner pings the on-call channel.
  • 23:25. We pull the agent offline and serve a maintenance banner. The last 48 hours of summaries are flagged for re-review.
  • 00:40. We reproduce the bad citation in staging with the same contract. Prompt, retrieved chunks and model output match production.
  • 01:50. We notice the retrieved chunk does not contain the article number. We check the chunk store. Half of the chunks for that contract look subtly off in their boundaries. The chunks were rebuilt by an automated nightly job that morning.
  • 02:30. We diff the chunking output against the previous week's snapshot. Boundaries have shifted by 1 to 4 tokens almost everywhere.
  • 03:15. We find a tokenizer changelog note in the vendor's release feed from Tuesday morning. One line. No deprecation window.
  • 04:10. We pin the prior tokenizer revision and run the chunker against a 50-contract canary. Boundaries match the pre-Tuesday state.
  • 05:30. We re-embed the corpus, all 4,127 contracts, using the pinned tokenizer. We run the eval set. 240 of 240 pass.
  • 06:30. Agent back online, maintenance banner removed. We start drafting the disclosure email for the firm's clients.

The fix in code

The shape of the fix is simple. Pin everything that touches text shape. Treat the tokenizer as a versioned dependency, not an implementation detail.

from tokenizers import Tokenizer

# Before: implicit version, loaded by name.
# tokenizer = Tokenizer.from_pretrained("vendor/text-embed-v3")

# After: explicit revision, pinned in source, mirrored to an internal store.
TOKENIZER_REPO = "vendor/text-embed-v3"
TOKENIZER_REVISION = "a14f0c2b9d1e"  # full commit hash, not a tag

tokenizer = Tokenizer.from_pretrained(
    TOKENIZER_REPO,
    revision=TOKENIZER_REVISION,
)

CHUNK_TARGET = 500
CHUNK_OVERLAP = 120  # doubled from 60; index grows ~12%, references survive

def chunk(text: str) -> list[str]:
    ids = tokenizer.encode(text).ids
    out, i = [], 0
    while i < len(ids):
        out.append(tokenizer.decode(ids[i:i + CHUNK_TARGET]))
        i += CHUNK_TARGET - CHUNK_OVERLAP
    return out

Two details worth pulling out. First, the revision is stored alongside every chunk row in pgvector. If we ever re-embed with a new tokenizer, the old chunks are visibly stale, not silently mixed. Second, the overlap doubled from 60 to 120 tokens. That is not free, the index grows by about 12%, but any reasonable article reference now appears in at least two chunks.

Anthropic publishes a token counting endpoint that we now call as a sanity check in CI: if the token count of a fixed reference contract drifts by more than half a percent between runs, the build fails. Their token counting docs were already in our bookmarks. We had just not wired them into the failure path.

Three permanent changes

We made three changes that we now apply to every agent we ship.

Pin every text-shape dependency by full revision. Tokenizers, embedding models, OCR engines, anything that turns text into other text. Major-version pinning is not enough; semantic versioning does not cover "the same string now produces a different integer sequence". If the vendor will not give you a revision hash, mirror the artefact yourself and pin against your mirror.

Run a chunk-integrity canary. Every hour, a fixed reference contract is re-chunked and its boundaries are compared against a locked baseline. If a single token boundary moves, the canary pages on-call. The reference contract has 47 article references; the canary asserts that all 47 still land inside a chunk, not on a chunk wall. The locked baseline lives in the same repo as the chunker, so a deliberate boundary change is one PR and one review, not a silent vendor push.

Disclose drift to the customer early. We told the firm's compliance lead within four hours of confirming the cause, and the firm told its clients within 24. The instinct is to wait until you have a clean postmortem. The right move is to flag the shape of the problem as soon as you understand it, and follow up with the detail. NCSC's incident response guidance phrases it differently but lands in the same place: do not wait for a tidy story to start the disclosure conversation.

Telling the firm

The disclosure conversation is the part of an incident most engineering teams underweight. Our default would have been to wait for a clean root cause, write a tidy two-page postmortem, then send it. The firm's compliance lead wanted something different: a single paragraph by 04:00 saying what we knew, what we did not know, and which contracts were affected. The detailed write-up could come later.

She was right. By 09:00 the firm's partners were on calls with the three clients whose contracts had gone out with bad citations. The conversations were short. The clients wanted to know two things: was it fixed, and would the firm tell them next time before they had to ask. The honest answer to both was yes. None of the three contracts had been signed off on the agent's wrong citation alone; the partners had read the underlying PDFs.

That last point is worth sitting with. The agent is a billable-hours redirector, not a replacement for a partner's eye. The day a firm starts trusting an LLM citation without re-reading the clause is the day the next vendor patch becomes a malpractice claim. We rewrote the in-product copy that week to make the re-read step harder to skip: every summary now ships with the source paragraph inline, not a link to it.

What this looks like for your stack

If you run a RAG-flavoured agent against any vendor model, three checks today, fifteen minutes each:

  1. grep your codebase for tokenizer and embedding model names. For every one, find the version string. If it ends in -latest, a bare major version, or no suffix at all, you are exposed to silent drift.
  2. Open your vendor's changelog feed and find the most recent tokenizer or model patch you did not deploy through. If you cannot point at one, you probably missed it.
  3. Pick one reference document from your corpus. Re-chunk it today, save the boundary offsets, re-chunk it again next Monday. If even one offset has moved, your pipeline is rebuilding itself under you.

When we built the contract-review agent for the Den Haag firm, the thing we ran into was that a vendor's tokenizer revision moved every chunk boundary by a few tokens without changing any output metric we tracked. We solved it by pinning the revision in source, doubling chunk overlap and adding a chunk-integrity canary. That kind of plumbing is most of what we do when we build AI agents for clients running real workflows.

Smallest thing to do today: run grep -RIn "from_pretrained\|encoding_for_model\|tiktoken.get_encoding" . on your agent repo. For every match, write the full revision hash next to it. If you cannot find one, that is the bug.

Key takeaway

If your RAG agent cites numbered references, an unpinned tokenizer is one vendor patch away from confident-looking hallucinations.

FAQ

What is tokenizer drift in a RAG pipeline?

When a vendor updates a tokenizer, the same text splits into different tokens. That moves chunk boundaries, can orphan key references, and changes what the retriever returns to the model.

How do I pin a tokenizer version?

Reference it by full revision hash in your model loader, not by name or major version. Mirror the artefact to an internal store so vendor-side deletes or renames do not break your build.

Does this affect OpenAI or Anthropic embeddings too?

Yes. Any vendor can patch a tokenizer between releases. Anthropic's token counting endpoint is the closest thing to a stable contract you can wire into CI as a drift check.

How often should a chunk-integrity canary run?

Hourly is enough for most RAG agents. The point is to catch drift before the next scheduled re-embed job rebuilds your index against a moved tokenizer.

ai agentsragcase studyarchitectureoperationstooling

Building something?

Start a project