← Blog

AI agents

Lab-notebook AI agents: a citation-gated LIMS playbook

A 38-person Eindhoven chemicals SME ships 620 synthesis-plan summaries a week. Here is how we built the agent that drafts them all and cites every CAS lookup.

Jacob Molkenboer· Founder · A Brand New Company· 8 Aug 2025· 9 min
Open leather notebook on ivory desk with brass relay, pneumatic tube, green sticky note, red wax seal on card.

A Tuesday morning in Eindhoven. The QA chemist opens her review queue and finds 124 synthesis-plan summaries waiting. Each one has to be cross-checked against hazard classifications, reaction conditions, and the lab's internal SOPs. By Friday she will have read more chemistry than a postdoc reads in a month. Her senior colleague spent five hours over the weekend drafting the previous batch by hand. This was the bottleneck we were hired to remove.

The client is a 38-person specialty-chemicals SME running R&D and small-batch production for European pharma. Their LabWare LIMS has been live since 2003. Twenty-two years of schema drift, twenty-two years of in-house plugins, and a small mountain of RTF blobs sitting inside what is technically a relational database. Nothing meaningful leaves that system. Anything we built had to read from it, never around it.

The 620-plan bottleneck

The lab runs about 620 synthesis plans a week across two sites. Roughly 18% are repeats with minor variation; the rest are bespoke. Each plan touches between four and twenty-two substances. Each substance carries a CAS number, a GHS classification, and a stack of reaction-condition flags that a QA chemist must verify against current source data before the plan goes to bench.

The senior chemist who drafts the weekly summaries does this on Saturday morning at his kitchen table. He is excellent. He is also one bad flu away from a six-week delay. The plant manager called us in March 2026 and asked, in so many words, whether an AI could read LabWare and write the summaries. We said yes, with conditions. The conditions are the point of this post.

Architecture in one paragraph

Nightly, a read-only ODBC view pulls the week's synthesis plans into a parquet snapshot on our side. A parser unpacks the RTF and JSON blobs into structured plan records. Every substance referenced gets its CAS number validated by check digit, then looked up against pre-fetched GESTIS and ECHA entries stored as a versioned cache in Postgres. The drafter (an LLM with structured output) is allowed to write a hazard or condition claim only if it attaches a citation ID pointing at one of those pre-fetched entries. Drafts that fail the citation gate never reach the QA queue. Drafts that pass go into a plain HTML approval app where the QA chemist accepts, rejects with a reason, or edits inline. The accepted draft is written back to LabWare via the same ODBC bridge as a structured plan-summary record, with the citation IDs preserved in an audit table.

That is the whole system. Every interesting decision lives inside one of those sentences.

Citations before tokens

The single hardest rule we enforced: the model is not allowed to invent a substance fact. Not "discouraged from", not "penalised for". Structurally not allowed.

The mechanism is unglamorous. Before the drafter sees a synthesis plan, we extract every CAS number in the plan and pre-fetch the matching GESTIS and ECHA records. Those records get a short, stable citation ID, for example gestis:50-00-0:2026-05-01. The model is given a structured-output schema where every hazard sentence requires a cited_from array of those IDs. If a sentence ships without a citation ID, the parser rejects the whole draft and triggers a regeneration with a tighter prompt.

This is the boring trick that makes the system trustworthy. The chemist does not have to wonder whether the model made up the LD50 figure for formaldehyde. The number is there only because the GESTIS entry it was lifted from is attached to it, and the hover state on her queue shows that GESTIS snippet inline.

Warning

If you let an LLM write hazard text without binding each claim to a pre-fetched source, you have built a hallucination engine with a chemistry vocabulary. Citation-before-token is not a nice-to-have for regulated work. It is the only mode that should ship.

CAS check digits stop a lot of fake numbers

Before any lookup runs, we validate the CAS Registry Number's check digit. A surprising number of bad LLM outputs we caught in early testing were CAS strings that simply did not validate. Burning the cycles to check is trivial:

def validate_cas_checksum(cas: str) -> bool:
    """CAS Registry Numbers carry a check digit. Reject anything that fails."""
    parts = cas.split("-")
    if len(parts) != 3 or not all(p.isdigit() for p in parts):
        return False
    digits = parts[0] + parts[1]
    check = int(parts[2])
    total = sum(int(d) * (i + 1) for i, d in enumerate(reversed(digits)))
    return total % 10 == check


assert validate_cas_checksum("50-00-0")      # formaldehyde
assert not validate_cas_checksum("50-00-9")  # corrupted

Any CAS that fails the checksum gets flagged, the draft is paused, and the plan is sent back to LIMS with a CAS-suspect tag. That happened 14 times in the first eleven weeks. Twice it caught a typo a human had introduced in 2019. The system surfacing legacy data errors was not on the brief, but it is now the feature the QA lead talks about most.

Reading a 22-year-old LIMS without breaking it

LabWare is a serious piece of software. It has also been customised at this site for two decades. We had three rules going in:

  1. We never write directly to the live LIMS database. Writes go through the supported ODBC bridge with a service account that has audited insert rights to two tables and nothing else.
  2. We never assume the schema. Every nightly snapshot starts with a schema fingerprint check; if a column has been renamed by an internal admin (it has, twice), the pipeline halts and Slack-pings the lab IT contact.
  3. Anything that looks like a plan body in the source data gets parsed three ways (RTF, plaintext, embedded XML) and the parsers vote. Disagreements become flagged plans that QA reviews unaided. About 1.8% of plans land in that bucket.

The temptation with legacy LIMS is to modernise them. Don't. The chemists trust LabWare because it has held their data accurately for twenty-two years. The agent is a layer that reads from it and writes a narrow, well-defined record back into it. That is the whole relationship.

The queue chemists actually use

The QA approval app is plain server-rendered HTML. No SPA. No client routing. It loads in about 180ms on the lab's Dell desktops, which still run Windows 10 and will until next year's refresh. One key approves. One key rejects, with a required free-text reason that feeds back into the retrieval set. The citation IDs render as small inline chips; hovering a chip opens the actual GESTIS or ECHA snippet that justified the sentence.

We built three prototypes before this one. Two of them were React. The chemists hated both, for the same reason: anything that flashed, animated, or required a second click broke their flow. The HTML version, served from a small Flask app behind their VPN, won inside one afternoon of usability testing.

Source rot and the citation cache

ECHA classifications change. GESTIS updates. A hazard statement that was right in March is sometimes wrong by August. The naive design caches a substance entry once and forgets. We did not do that.

The citation cache in Postgres is partitioned by source and snapshot date. Each partition holds the GESTIS or ECHA entries fetched on a given day. When a new snapshot of a substance arrives, it goes into the current partition; when a partition ages out of the 18-month retention window, we drop it whole. A thread that did the rounds on Hacker News last week put the pattern bluntly: the only scalable delete in Postgres is DROP TABLE. For a cache that takes millions of small rows, partition-and-drop is what keeps storage flat and index bloat from killing query times. The same principle applies to embedding tables in any RAG system worth its bandwidth.

Every citation ID embeds the snapshot date, so if a QA chemist opens an approved draft from six months ago, the hover state shows the GESTIS snippet as it stood on the day the draft was written, not today's version. That is what auditors want to see. It is also the cheapest insurance policy we have ever sold.

What broke first, and what we measured

Three things broke in the first six weeks of production.

Stale snapshots silently feeding the drafter. The ODBC bridge crashed on a Friday afternoon. The agent kept generating drafts against Thursday's data until Monday. Fix: a heartbeat that blocks generation if the snapshot is older than 12 hours, and a Slack alert that escalates after the second missed cycle.

Drafts that were technically correct but wrong by house style. The lab has internal naming conventions; some substances have nicknames older than the chemists who use them. We added a glossary document to the retrieval set, and the drafter now matches house terminology in about 96% of drafts.

QA fatigue on near-identical repeat plans. 18% of plans are minor variations on the same family. We added a diff-against-last-accepted view so the chemist sees only what changed since the prior week. Review time on repeats dropped by roughly 70%.

Eleven weeks in, the senior chemist's drafting time went from five hours a week to about 35 minutes of review. The QA rejection rate on agent drafts sits at 4.1%, against a 6.2% baseline we measured on the prior human-only workflow. We sampled 1,100 hazard sentences from production drafts for citation accuracy. Three were wrong: one stale snapshot, one CAS collision between two substances with similar trade names, one parser bug we shipped a fix for in week eight. Zero hallucinated hazard claims have reached production, because the citation gate makes the failure mode arithmetically impossible.

The thing to do today

If you are looking at a similar workload (regulated text drafted against a legacy system of record), the smallest useful thing you can do this afternoon is list every external fact the draft has to be correct about and ask, for each fact, where the source lives and how you would attach a stable citation ID to it before the model writes a word. If you cannot answer that for half the facts, the agent is not the next step. The retrieval set is.

When we built this lab-notebook agent for the Eindhoven site, the hardest design decision was not the model or the prompt. It was deciding to fail loudly when a citation was missing rather than ship a draft that read well. If you want help thinking through a similar build, the way we approach AI agents for regulated workloads starts from that constraint and works backwards.

Key takeaway

If a regulated draft cannot cite the source for every external fact before the model writes a word, you have built a hallucination engine, not an agent.

FAQ

Can this kind of agent work without a LIMS at all?

Yes, but you need some structured system of record for the plans. A shared folder of Word documents is workable; a single Excel file is not. The agent needs something queryable to attach citations to.

What happens if a CAS number isn't in GESTIS or ECHA?

The draft pauses on that substance and is routed to QA without a generated summary. The chemist can attach an internal SDS, which becomes a new citable source in the retrieval set for future runs.

How long did the build take from kickoff to production?

Nine weeks. Three of those were spent parsing legacy RTF blobs out of LabWare. The model and prompt work was less than two weeks of real effort.

Why server-rendered HTML instead of a modern frontend?

Lab desktops are slow, chemists work in short focused bursts, and any animation or routing delay broke their flow. Plain HTML loaded in 180ms; the React prototypes did not.

ai agentsragknowledge baseprocess automationintegrationscase study

Building something?

Start a project