Case study

Elementary school AI after OK20: a Leiden bijles rebuild

A Leiden school umbrella pulled its bijles agent off four campuses the week Norway's OK20 decision landed. We rebuilt it as a docent-in-the-loop RAG. Here is what changed.

Jacob Molkenboer· Founder · A Brand New Company· 21 Jun 2026· 10 min

Leather notebook, cream index card with umbrella symbol, brass clip, green ribbon, wooden ruler, chalk, red wax seal.

It is a Tuesday in February. Groep 7 at one of the four vestigingen in Leiden-Zuid is working on breukrekenen. The Snappet-bijles agent has just pushed an oefenset onto a ten-year-old's tablet that the SLO-leerlijn places solidly in groep 8. The juf swipes it away and writes the moment down in a notebook she has started keeping. By the end of the week the notebook has eleven entries. By the end of the month it has fifty-two.

The next Monday, Norway's OK20 decision lands: a near-ban on autonomous AI tutoring in barneskolen, the elementary tier. The koepel's bestuur meets that Friday morning. By the following Wednesday the Snappet-bijles agent is switched off across all four vestigingen.

This is what we built to replace it, and what fourteen weeks of running both systems in parallel actually showed.

Why the koepel pulled, not paused

The koepel runs nineteen people, four campuses, and roughly 720 leerlingen. They did not pause the agent. They pulled it. Two reasons.

First, the Inspectie van het Onderwijs had already started asking pointed questions about differentiation evidence at one of the vestigingen. Under the Wet primair onderwijs, schools have to be able to show how they adapt instruction per leerling. The honest answer at the time was "the algorithm picks." That answer does not survive an inspection visit, and the koepel's bestuur knew it.

Second, the docenten could not explain what the agent was doing. Eleven swipes in the notebook in one week is eleven moments where a teacher overrode the system without being able to point at why the system suggested what it suggested. Norway's decision was the trigger; the local case had been building for months.

So the question we got, the Thursday after the bestuursvergadering, was not "can you ship a safer agent." It was "can you ship an agent the docenten understand and the Inspectie can audit, by the start of the second helft."

What the new system actually had to do

Three constraints, all non-negotiable.

Every oefenset suggested to a leerling under twelve had to cite a concrete leerdoel from the SLO-leerlijn for that leerling's leerjaar. Not a paraphrase. The literal kerndoel text, with its identifier and the URL of the document it came from. If no leerdoel fit, the agent refused to suggest anything.

A docent had to approve the suggestion before the leerling saw it. Docent-in-the-loop, not docent-on-top. A "review queue you can clear later" was not docent-in-the-loop; that pattern was exactly what the koepel had just walked away from. The leerling sees nothing until a human has signed off.

Every decision — suggestion, approval, rejection, docent override, outcome — had to be logged in a format the Inspectie could read without a translator. Not screenshots. Structured records, queryable, with the model version stamped on each row.

Generation quality came fourth. The point of this build was not better answers. It was defensible answers.

The architecture, in one drawing

A leerling submits an opdracht and gets it wrong. A small classifier tags the failure pattern — for example, noemers_niet_gelijk or kommagetal_verschoven. That pattern, plus the leerling's current leerjaar, goes into the retrieval layer.

Retrieval hits the SLO-leerlijn first, not the oefenset library. The leerlijn is the canon; the exercises follow from the leerdoel, never the other way around. The top leerdoel — filtered hard to the leerling's leerjaar and below — becomes the citation that everything downstream has to defend.

Only then does the system pick an oefenset that maps to that leerdoel. The suggestion enters the docent's queue with the literal kerndoel text attached. The leerling sees nothing until the docent approves.

def suggest_oefenset(leerling_id: str, opdracht_id: str) -> Suggestion:
    leerling = pupils.get(leerling_id)
    mistakes = recent_mistakes(leerling_id, window_days=14)
    topic    = topic_of(opdracht_id)

    # Retrieval over the SLO-leerlijn, NOT the exercise library.
    leerdoelen = slo_index.search(
        topic=topic,
        mistake_tags=mistakes.tags,
        k=8,
    )

    # Hard ceiling at the leerling's current leerjaar.
    leerdoelen = [d for d in leerdoelen if d.leerjaar <= leerling.leerjaar]
    if not leerdoelen:
        return Suggestion.skip(reason="geen_passend_leerdoel")

    target   = leerdoelen[0]
    oefenset = exercise_db.match(
        leerdoel_id=target.id,
        mistake_pattern=mistakes.primary,
    )

    return Suggestion(
        leerling_id=leerling_id,
        oefenset_id=oefenset.id,
        leerdoel_id=target.id,
        leerdoel_citation=target.source_quote,   # literal SLO text
        leerdoel_url=target.source_url,
        model_version=MODEL_VERSION,
        status="awaiting_docent_review",
    )

The leerjaar ceiling on the second filter is the line the old Snappet-bijles agent kept crossing. The new system refuses to suggest above-leerjaar work even when a leerling's recent performance would justify it pedagogically. If a docent thinks a leerling is ready to move up, that decision happens explicitly, in the docent queue, with the docent's name on the override.

The re-ranker that scores candidate leerdoelen against the opdracht text is a small bi-encoder fine-tuned on three years of the koepel's own oefensets, not a frontier model. A larger model did not help; the bottleneck was domain vocabulary, not reasoning. Inference per suggestion costs effectively nothing, which matters when a single vestiging produces two thousand suggestions in a busy week.

Takeaway

Log the citation, not just the answer. An audit trail without the source the model used is useless to the Inspectie, and useless to you the next time a teacher disagrees with the system.

The fourteen-week parallel cohort

From mid-February to the end of May, two cohorts at two of the four vestigingen ran in parallel. Cohort A — 184 leerlingen — kept the old autonomous Snappet-bijles agent. Cohort B — 178 leerlingen — ran on the new RAG with docent-in-the-loop. Same opening rekentoets, same closing rekentoets, same lesson plan template, same six docenten rotating between the two cohorts every three weeks. The rotation was uncomfortable to manage and unavoidable: without it, any delta between cohorts would be confounded by which teacher you happened to draw.

The result we expected was a small loss on raw rekentoets-delta for cohort B, in exchange for tractable Inspectie-bewijslast and lower variance. Roughly that is what we got.

The mean rekentoets-delta for cohort A was about a point higher than cohort B over the fourteen weeks. The standard deviation in cohort B was meaningfully narrower: the agent never suggested an oefenset above-leerjaar to a leerling who was already behind, so the long tail of frustrated children that the old system tended to grow did not develop. Two leerlingen in cohort A had to be moved into bijwerken at the end of the period; cohort B had none.

None of that is a knock-down case for one cohort over the other. It is a case that the trade-off the koepel actually wanted — modest progress floor, no ceiling-overshoot, a defensible per-leerling record — was achievable without giving up the underlying technique. The bestuur signed off on rolling cohort B's system out to all four vestigingen from September.

The Inspectie file

The unglamorous part of this build was the audit log. It is also the part that paid for the rest.

Every suggestion, every approval, every rejection, every docent override, every leerling outcome is one row in a Postgres table that the koepel's bestuur exports as Parquet at the end of each schoolweek. The columns are boring: timestamp, leerling-id (pseudonymised), docent-id, leerdoel-id, leerdoel-citation, oefenset-id, model-version, decision, decision-reason. The schema has not changed since week three.

Sixteen columns in total: identifiers for the event, school, vestiging, pseudonymised leerling, docent, suggestion, leerdoel, and oefenset; the literal citation text; the source URL; the model version string; the event type; and three columns for the decision and its reason code. One row per event, one Parquet file per schoolweek. An average week comes in under nine megabytes. A full schooljaar of audit evidence fits on a USB stick.

When the Inspectie visited in May, the koepel handed over a query the inspecteur could run themselves: "give me every leerling whose suggested work was above leerjaar in the last twelve weeks." The answer was zero rows. Then: "give me every override the docenten made, grouped by docent and reason." That answer had three hundred-odd rows and a clean distribution across teachers. The inspection wrapped by lunch.

What broke, what we changed

Three things did not go to plan.

The first week, the docent queue overflowed. The agent generated more candidate oefensets than the docenten could review during lesson hours; suggestions piled up overnight and the docenten arrived to a list they could not clear before the eerste pauze. We added a batch-approval mode for low-risk suggestions: same leerdoel as the previous approved suggestion for that leerling, identical oefenset structure, no new mistake pattern. The docent approves the batch with a single tap and a sample-check. Median time from suggestion to decision dropped from 47 minutes to 11.

The second was retrieval drift. Some of the SLO-leerlijn topic descriptions are loose enough that the retriever would occasionally surface a near-match leerdoel from a different sub-domain — say, a meetkundig kerndoel for a leerling who had stumbled on a verhoudings-opdracht. We added a re-ranker that scored each candidate leerdoel against the actual opdracht text, not just the topic tag, and pushed the threshold for "no suggestion" from 0.6 to 0.7. The agent now skips more often. The docenten prefer it.

The third was the docenten themselves. Two of the six rotating docenten said, three weeks in, that they trusted the new system less than the old one — precisely because they could now see what it was choosing. We took that seriously. The fix was redesigning the docent queue UI to lead with the leerdoel-citation, not the oefenset preview. When the docent sees the curriculum line first, the suggestion reads as a curriculum decision the agent is sourcing, not as the agent picking and asking for permission. Same data, different framing, completely different trust posture.

We checked again four weeks later. The two docenten who had flagged distrust were approving at the same rate as their peers and overriding less often. Putting the leerdoel first did not just rearrange the queue UI; it changed who in the room felt accountable for the pedagogical decision. That accountability belonged with the docent the whole time. The agent had been quietly carrying it, which is the part nobody had a way to name until they saw it gone.

The smallest thing you could do today

If you run an AI agent in any regulated context — schools, healthcare, finance, public sector, anywhere with an inspecteur — pick one decision your system made today and ask: can you point at the source it used? Not the prompt. The source. If the answer is "the model just knew," you have an audit problem waiting to land. Add the citation column to your log before you add the next feature.

When we built the docent-in-the-loop RAG for the Leiden koepel, the hard part was not the retrieval and not the model. It was the audit schema and the queue UI. We ship AI agents in regulated contexts mostly the same way each time now: cite first, generate second, log everything, and put a human on the approval seat with a citation-first view.

Key takeaway

In regulated AI, log the source the model used, not just the answer. The audit trail you build today is the only one you can defend tomorrow.

FAQ

Did the new system make the pupils learn slower?

Cohort B's mean rekentoets-delta was about a point lower than cohort A's over fourteen weeks, but variance was narrower and no leerlingen had to be moved into bijwerken.

Why cite the SLO-leerlijn specifically?

It is the canonical curriculum document Dutch primary schools are measured against. Citing it gives the Inspectie a direct line from any suggestion back to a national standard.

How long does docent-in-the-loop approval add per suggestion?

After the batch-approval mode shipped, median time from agent suggestion to docent decision was 11 minutes. Before it, the queue regularly carried over to the next day.

Could you do this without a RAG layer?

Yes, but you would lose the citation. The point is not generation quality. It is traceability: a defensible source for every output the leerling and the Inspectie ever see.

ai agentsragcase studyknowledge baseoperationsarchitecture

Building something?

Start a project