AI agents

LLM provenance audits: catching a relabelled GEITje merge

A Breda municipal-tech vendor handed us a Dutch LLM they said was built in-house. Two hours into the tender review, the tokenizer hash matched a public checkpoint.

Jacob Molkenboer· Founder · A Brand New Company· 16 Jun 2026· 8 min

Brass notary seal on cream certificate with duplicate impression, green wax thread, clay-red ink pad, dark green backdrop.

Thursday morning, 9:14, a third-floor meeting room above the parking garage on Stationsweg in Breda. Six people at the table: three procurement leads from a Dutch municipal IT cooperative, two engineers from a 23-person GovTech vendor, and us, sitting in as the technical reviewer on a tender for a Dutch-speaking citizen-services agent. The vendor's deck said "proprietary Dutch language model, trained in-house on Dutch infrastructure." The contract was worth a year of the vendor's revenue. The decision was supposed to land before lunch.

The model card the vendor sent the night before listed a 7B parameter LLaMA-style decoder, "trained from scratch on a curated Dutch corpus." The artifacts arrived as a 14 GB safetensors bundle, a tokenizer.json, and a config.json. We had ninety minutes before the procurement decision. The first file we opened was the tokenizer.

The signal that ended the meeting

We ran a SHA-256 of the sorted vocabulary and merge table against a reference set of public Dutch and multilingual checkpoints. The hash matched, byte for byte, the tokenizer shipped with mistralai/Mistral-7B-Instruct-v0.2. It also matched Rijgersberg/GEITje-7B, the best-known Dutch continued-pretraining of Mistral-7B, which inherits the Mistral tokenizer unchanged.

A tokenizer is the one artifact you cannot accidentally reproduce. If you genuinely train a model from scratch on Dutch text, you choose your own vocabulary size, your own special tokens, your own merge table. The probability of two independent BPE trainings on different corpora producing the same 32000 entries in the same order is effectively zero. The hash decided the meeting.

We did not stop there. The tokenizer match told us the vendor had inherited a parent. The interesting question was which parent, and whether more than one had contributed. So we moved on to the weights.

The embed_tokens.weight tensor in the vendor's bundle had a row-wise mean cosine similarity of 0.94 with the GEITje-7B embedding matrix and 0.91 with Mistral-7B-Instruct. The lm_head sat closer to Mistral-Instruct. Layer by layer, the lower blocks tracked GEITje, the upper blocks tracked Mistral-Instruct, with a smooth sigmoid handover in the middle. That is not a finetune. That is a SLERP merge, probably done with mergekit, then quantized, then relabelled.

Warning

If a vendor's "in-house" 7B Dutch model ships with a tokenizer of vocab_size 32000 and the Mistral SentencePiece layout, you are looking at a Mistral derivative. Ask which one, and which merge tool, before signing anything.

Awkward, not malicious

GEITje-7B was, for most of 2024, the most usable Dutch open base model in the 7B class, released under Apache 2.0. Merging it with Mistral-7B-Instruct, also Apache 2.0, to get a Dutch model that follows instructions is a defensible engineering shortcut. It saves six figures of training compute and two months of calendar time. The problem is not the technique. The problem is the label.

A municipal procurement officer reading "proprietary Dutch LLM, trained in-house" makes three assumptions: the vendor controls the training data, the vendor can patch the model when something breaks, and the vendor owns the IP outright. Two of those are wrong for a relabelled merge. The training data lives in two upstream corpora the vendor has never seen. Patching a merge is brittle, because re-merging shifts every weight in the network. The IP position depends on Apache 2.0 attribution, which the vendor's model card had quietly omitted.

This matters because the same vendor was bidding on a contract that included answering GDPR data-subject access requests via the model. If the model hallucinates a citizen's housing-benefit history, the cooperative wants a clear chain of accountability for the training corpus. A relabelled merge severs that chain twice over: once at GEITje's pretraining mix and once at Mistral-Instruct's instruction-tuning set.

The six-step provenance diff we now run

Since the Breda morning we have run this on every model a vendor presents in a Dutch GovTech tender. It takes about forty minutes per candidate, requires no special tooling beyond Python and the Hugging Face stack, and assumes you have local access to a small candidate set of public checkpoints. We keep ours in object storage, version-pinned to the releases vendors merge most often: Mistral-7B and 7B-Instruct (v0.1, v0.2, v0.3), GEITje-7B, GEITje-7B-Ultra, Llama-2-7B, Llama-3-8B, and Qwen2-7B.

1. Tokenizer fingerprint

Compute a hash of the sorted vocabulary and the merge table. If it matches any public model, the vendor's "from scratch" claim is finished and you proceed to identify the parent. If it matches nothing public, do not relax yet: the model might still be a quantized derivative with a regenerated tokenizer.

from transformers import AutoTokenizer
import hashlib, json

def tokenizer_fingerprint(model_id_or_path):
    tok = AutoTokenizer.from_pretrained(model_id_or_path)
    vocab = sorted(tok.get_vocab().items(), key=lambda kv: kv[1])
    payload = json.dumps(vocab, ensure_ascii=False).encode("utf-8")
    return hashlib.sha256(payload).hexdigest()

for ref in [
    "mistralai/Mistral-7B-Instruct-v0.2",
    "Rijgersberg/GEITje-7B",
    "meta-llama/Llama-2-7b-hf",
    "./vendor-bundle",
]:
    print(ref, tokenizer_fingerprint(ref))

2. Embedding cosine similarity

Two models that share a tokenizer can still have unrelated embedding matrices, if one was actually trained from scratch on that vocab. A row-wise mean cosine similarity above 0.85 against any public parent is a derivation, not a coincidence. We compute against every candidate in our reference set.

import safetensors.torch as st
import torch

def embed_matrix(safetensors_path):
    weights = st.load_file(safetensors_path)
    return weights["model.embed_tokens.weight"].float()

vendor = embed_matrix("vendor-bundle/model-00001-of-00003.safetensors")
geitje = embed_matrix("geitje-7b/model-00001-of-00002.safetensors")
sim = torch.nn.functional.cosine_similarity(vendor, geitje, dim=1)
print("mean cosine:", sim.mean().item(),
      "p05:", sim.quantile(0.05).item(),
      "p95:", sim.quantile(0.95).item())

3. Layer-by-layer drift profile

Compute the per-layer cosine similarity for the q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj of every transformer block, against each candidate parent. Plot it. A clean continued-pretraining produces a flat similarity line near 0.95. A targeted finetune shows a flat line with a small dip in the upper layers. A merge produces a step function, a sigmoid, or two interleaved bands. The shape tells you the merge family: a clean step is linear interpolation, a sigmoid is SLERP, two interleaved bands suggest TIES or DARE.

4. Identical-prompt drift test

Run the vendor's model and each candidate parent against the same fifty Dutch prompts at temperature 0, max_new_tokens 256, with the same seed. Compute the Levenshtein distance between outputs. If the vendor's outputs sit within 8% edit distance of one parent for some prompts and the other parent for other prompts, you have behavioural confirmation that lines up with the weight-side profile from step 3.

This step is also where you catch the rarer case: a vendor who fine-tuned on a public model's own outputs to look further away in weight space than they actually are. The Levenshtein test does not care about weights, only behaviour.

5. Architecture and config diff

The config.json gets diffed against the candidates field by field. Vendors who merge often forget to change hidden_size, num_attention_heads, num_key_value_heads, rope_theta, or sliding_window. Mistral-7B-Instruct-v0.2 has a distinctive rope_theta of 1000000.0 and no sliding_window. v0.1 has sliding_window 4096 and rope_theta 10000.0. If those exact values appear in a "from scratch" model, write the version number into your report.

6. License and attribution chain

Read the model card, the LICENSE file, the NOTICE file, and the README. If the model is a derivative of Apache 2.0 weights, the vendor must reproduce the upstream NOTICE files and credit the original authors. A missing NOTICE is not a forensic signal on its own. Combined with steps 1 through 5 it tells you the vendor either does not understand their obligations or is hoping you will not check. The procurement consequence is identical.

What we recommended to the cooperative

The vendor was not disqualified. We wrote a one-page memo, attached the six diff outputs as appendices, and recommended three changes to the bid. First, restate the model in the contract as "Mistral-7B-Instruct, continued-pretrained on Dutch via GEITje, merged via mergekit," with the upstream licenses attached and the merge recipe checked into a repository the cooperative could read. Second, commit to a documented re-merge process so the cooperative could rebuild the artifact from upstream if the vendor disappeared. Third, name a fallback model from the public Dutch 7B set with a tested swap procedure and a clear performance delta.

The vendor accepted all three. The cooperative awarded the contract two weeks later, with a clause requiring a tokenizer-hash-and-config attestation at every model update. The procurement officer told us afterwards that the relabelled-merge problem had come up twice before in the previous year, both times caught accidentally, neither time documented. The six-step diff is now in the cooperative's standard evaluation pack.

The local-model conversation Dutch procurement is missing

There is a thread near the top of the Hacker News front page this week asking whether anyone has replaced Claude or GPT with a local model for daily coding. The answers are mixed, and the discussion assumes a sophisticated user who can swap models, read benchmarks, and notice when output quality shifts under their fingers. Dutch municipal buyers have none of that infrastructure. They depend on the label on the bundle to be accurate, because nobody on their side will run the diff.

The cheapest fix is the one this incident points to. Any tender that names a specific model owes the buyer a provenance file, signed, with hashes for the tokenizer, the config, and the first and last embedding rows. A six-step diff against public checkpoints takes a competent engineer half a day. The cost of not running it is a procurement decision made on a label that does not survive an hour of inspection, on a model that will be in production answering questions from citizens for the next three years.

The smallest thing to do today

If you are evaluating a vendor's "in-house" LLM this quarter, run step one before your next meeting. The tokenizer hash takes ninety seconds and tells you whether the rest of the audit is worth doing. When we built the provenance-review pipeline for the Breda cooperative, the bottleneck was not the diff itself, it was writing the attestation clause that survives a vendor swap. ABN helps Dutch GovTech buyers run this kind of AI agents provenance audit on shortlisted vendors before contracts are signed.

Key takeaway

A tokenizer hash takes ninety seconds and ends most 'we trained it in-house' vendor claims; if it matches a public checkpoint, the rest of the audit is paperwork.

FAQ

How do I know if a vendor's LLM is actually trained from scratch?

Hash the sorted tokenizer vocabulary and merge table. A genuinely from-scratch model has a unique BPE table, so an identical hash to Mistral, Llama, or GEITje means the vendor inherited that model.

Is merging GEITje with Mistral-Instruct technically legal?

Yes when both upstreams are Apache 2.0 and the vendor reproduces NOTICE files and credits upstream authors. The legal risk is in omitting the attribution, not in the merge itself.

How long does the six-step provenance audit take?

About forty minutes per candidate once public reference checkpoints are staged locally. Step one alone takes ninety seconds and ends most weak claims.

What if the tokenizer hash is unique to the vendor?

Continue with embedding cosine similarity and the layer drift profile. A vendor can regenerate a tokenizer to hide a merge, but the weights still carry the parent fingerprint.

Should we disqualify vendors who present merged models as in-house?

Not necessarily. Require a restated model description, the merge recipe in a readable repository, upstream licenses attached, and a fallback model with a tested swap procedure.

ai agentssecurityarchitecturestrategytooling

Building something?

Start a project