RAG
RAG fact-checking: how a Leiden publisher rewired its desk
A Leiden specialty publisher had six editors fact-checking against a 41,000-article archive. We replaced the retrieval step with a RAG agent. Here is what we learned.

The Wednesday afternoon problem
A senior editor at a Leiden specialty publisher slides a 6,000-word manuscript across her desk. It is an essay on early modern trade routes, written by an academic her house has published for nineteen years. The author is reliable. The footnotes are not. Three previous manuscripts contradicted dates already established in the house's own archive, and the corrections shipped late, embarrassing both sides.
For most of 2024, the way this got caught was a six-person fact-checking desk that read every accepted manuscript against the back catalogue. They were good. They were also slow, expensive, and structurally impossible to scale past the eighty or so titles the house ships each year. By autumn, the publisher asked us a question that sounds simple. Can a machine do the first pass?
This is a story about saying yes carefully. Some of it worked. Some of it surprised us. The desk still exists, just not the way it used to.
Fact-checking as a retrieval problem
Fact-checking inside a specialty publisher is not the same as fact-checking in journalism. A journalist verifies claims against the world. An editor at a specialty house verifies claims against the house. The world might say the Treaty of Utrecht was signed in 1713. The house's own 2019 monograph on Dutch trade said the same thing, but pinned a specific clause to a specific April letter. If a new manuscript fudges that, the contradiction shows up in your own footnotes, in print, on shelves you control.
That makes the problem narrow in a way LLMs handle well. You are not asking the model to know things. You are asking it to find what the house already published, compare the new claim to it, and tell you when they disagree. That is a retrieval task with a comparison layer on top. Canonical RAG, give or take the comparison.
The archive, in numbers
41,000 articles, chapters, and standalone essays, going back to 1971. About 70% sat in XML (JATS-flavoured, mostly clean). The other 30% lived in two earlier formats: a custom SGML dialect from the 1980s, and scanned PDFs with OCR that the editorial desk had been quietly correcting for a decade. Average article length: 4,800 words. Total corpus around 197 million words, or roughly 270 million tokens after our tokenizer.
None of this is large by 2026 standards. It is small enough to embed in an afternoon and fit comfortably in a single Postgres instance. The hard part was not volume. The hard part was that the corpus contradicts itself. The house has published competing positions across forty-five years, and ground truth is contextual: a claim is true relative to which decade of scholarship you trust.
Architecture, deliberately dull
We resisted the temptation to do anything clever. The stack:
- Postgres 16 with pgvector for the article corpus
- Hybrid retrieval: BM25 running alongside dense embeddings
- Reciprocal rank fusion to merge the two result sets
- A small reranker (bge-reranker-base, self-hosted on a single GPU) for the top 50 candidates
- A frontier LLM for the contradiction-detection step, given the top 8 retrieved passages and the candidate paragraph from the manuscript
Embeddings: we landed on a multilingual model because roughly 22% of the archive is in Dutch, French, or German, and editors expect cross-language retrieval. A 1994 Dutch article on textile trade should surface for an English-language draft on the same period. We tested four embedding models on a hand-labelled set of 200 manuscript paragraphs with known archive matches. The multilingual model won by enough margin that we stopped looking.
Chunking was less interesting than we expected. We tried paragraph-level, section-level, and sliding-window chunks at 512 and 1,024 tokens. Section-level at 512 tokens with a 64-token overlap won, but only just. The reranker absorbs a lot of chunking sins.
The contradiction problem
Retrieval gets you the passages that look relevant. It does not get you contradictions. A naive pipeline retrieves the top-K passages, hands them to an LLM with “find disagreements,” and ships whatever comes back. We did exactly that for the first pilot. The false positive rate was 38%.
The model flagged anything that looked tonally different. A 1987 article saying “the prevailing view holds X” and a 2025 draft saying “X is settled” got flagged as contradictory, even though the second is a strict implication of the first. Editors hated it. Within three days they stopped reading the flags.
What worked: a two-stage prompt. The first call extracts atomic claims from the new manuscript, in a normalised form (claim, subject, date, source). The second call, for each retrieved passage, asks a narrower question. Does this passage state a value for the same claim that differs from the new manuscript? Quote both, or return null. Splitting the work cut false positives to 6%. Editors started reading the flags again.
def detect_contradictions(manuscript_paragraph, retrieved_passages):
claims = llm.extract_claims(
text=manuscript_paragraph,
schema=AtomicClaim,
)
findings = []
for claim in claims:
for passage in retrieved_passages:
result = llm.compare(
prompt=CONTRADICTION_PROMPT,
claim=claim,
passage=passage,
response_format=ContradictionVerdict,
)
if result.contradicts:
findings.append({
"claim": claim,
"archive_passage": passage,
"new_quote": result.new_quote,
"archive_quote": result.archive_quote,
"confidence": result.confidence,
})
return findings
The CONTRADICTION_PROMPT is the load-bearing part. It runs about 600 words and includes nine worked examples drawn from the desk's own correction log. We rewrote it eleven times before editors stopped complaining.
Hitting the 8-second budget
The pilot ran in 47 seconds per manuscript paragraph. The desk's tolerance was somewhere around 10 seconds. Editors will wait if the wait feels like work. They will not wait if it feels like watching a spinner. We needed to fit a single paragraph's check into the time it takes to highlight the next one and scroll.
The wins came from three places. Caching the claim-extraction step per paragraph (the second pass over a revised draft hits the cache for everything that did not change) cut median latency by about 60%. Running the contradiction-comparison calls in parallel against the top 8 retrieved passages, with a small concurrency cap to keep the rate limiter happy, took out another big chunk. And switching the reranker to a quantised on-prem deployment beat the round-trip to a hosted reranker every time.
End state: median 8.2 seconds per paragraph, 95th percentile 14 seconds. Good enough.
For domain-specific RAG, the embedding model matters less than you think and the comparison prompt matters more. The expensive thinking happens after retrieval, not during it.
What happened to the six people
Two left during the rollout, for unrelated reasons. The other four are still on the desk. Their job changed.
The agent does the first pass. It reads the manuscript, flags candidate contradictions, attaches the archive quote, and writes a one-line justification. A human reviews every flag before it reaches the author. The agent does not write to the manuscript; it writes to a queue. Editors triage the queue. They reject roughly 6% of flags as wrong, accept 71% as worth raising with the author, and mark the rest as “interesting but not a contradiction,” a category we did not anticipate and had to build a UI for.
The desk now handles roughly three times the volume per person, but the work is different. Less reading-everything, more judgement on edge cases. Two of the four have started writing institutional memory documents from the patterns the agent surfaces. That was not in the brief. It is the best thing we have seen come out of the project.
What we got wrong
Three things, in order of how much they cost.
We underestimated the OCR layer. Roughly 11,000 of the 41,000 articles came from scans. The desk had been correcting OCR errors for years, but the corrections lived in a separate text layer that we missed on the first ingestion. The agent kept flagging contradictions that were really just OCR garbling. We re-ingested with the correction layer merged in and the false positive rate dropped by a third overnight. Lesson: ask twice about how the corpus actually got into its current shape, then ask a third time about the corrections nobody mentions.
We over-trusted the reranker. The reranker is good. It is not infallible. For the first month we showed editors only the top 5 passages per query. They kept asking for “the obvious one” that the reranker had placed at position 12. We now show the top 8 by default and let editors expand to the top 20. Latency cost: minor. Trust cost of not doing it: large.
We shipped without an audit log. Three weeks in, a senior editor asked which version of the agent had flagged a specific contradiction in a manuscript that had since been revised twice. We did not know. We retrofitted full provenance (model version, prompt version, retrieved passages, scores) in week four. If your RAG agent makes editorial decisions, version-pinned audit logs are a day-one requirement, not a refactor.
Where this approach breaks
It works for the Leiden publisher because the archive is contained, well-formed, and trusted. The same approach would fail in three places we can name.
Open-web fact-checking, where the corpus is the world: retrieval becomes a search engine problem, and contradiction detection has to handle bad-faith sources. Different project entirely.
Legal or medical, where a missed contradiction is a deposition: the threshold for human-in-the-loop is much tighter, and the audit requirements push the architecture toward fully traceable rule-based systems with RAG as one input among many.
Any domain where the corpus updates faster than your embedding pipeline can keep up: the publisher's archive grows by 80 titles a year. A news organisation gets that much before lunch. The cost model changes shape.
If you are considering RAG for editorial work, or any domain where a contained archive contradicts itself, the cheapest first step is a one-afternoon experiment. Hand-label thirty paragraphs from a recent manuscript with their known archive precedents, then see whether any off-the-shelf embedding model retrieves the right passage in the top 10. That is the only test that tells you whether retrieval is even the right shape for your problem. When we built the contradiction agent for the Leiden publisher, the thing we ran into was that the comparison prompt mattered more than every other piece combined. We ended up rewriting it more times than the rest of the system. If you want the same shape for your own corpus, that is the kind of RAG work we do.
Key takeaway
For domain RAG, the comparison prompt does more work than the embedding model. Rewrite it more times than the rest of the system combined.
FAQ
Can a RAG agent replace fact-checkers?
No, and the question misses what fact-checkers actually do. A RAG agent handles first-pass retrieval against your own archive. Humans judge whether a flagged contradiction matters and how to raise it with the author.
How big does an archive need to be for RAG to be worthwhile?
It is not size, it is repetition. If your domain experts run the same retrieval queries every week, even a 1,000-document corpus pays back the build cost within a quarter.
What is the latency floor for in-line editorial RAG?
Roughly 8 to 10 seconds per paragraph before editors lose patience. Cache the claim-extraction step, parallelise the comparison calls, and host the reranker yourself to stay inside that window.
Why not just fine-tune an LLM on the archive?
Fine-tuning bakes facts into weights you cannot audit or version per document. RAG keeps the archive separate, so you can update it daily and trace every flag back to a specific source passage.