Case study
Off the tablets: a worker-side RAG for 11 childcare sites
The directeur put the printed Norwegian OK20 letter on her desk on a Tuesday in February. By Friday, the toddler tablets were off the shelves in all eleven vestigingen.

The directeur of a 24-person kinderopvang-koepel in Breda put the printout of the Norwegian OK20 recommendation on her desk on a Tuesday in February. Eleven locations. Around 380 children under four. And a tablet-based oefen-agent the cooperative had rolled out eighteen months earlier, sitting on a low shelf in every groepsruimte.
The agent did vocabulary games, simple cause-and-effect prompts, and a kijk mee mode where a peuter tapped through illustrated story choices. It was popular with the kids. It was popular with the parents who saw it on the parent-app dashboard. On the new reading from Oslo, it was also exactly the kind of thing you stop giving to a child under four.
The board met the following Monday. By Friday, the tablets were off the shelves. This is the post-mortem of what we put back.
What “agent” meant for a toddler
The old shape was simple, and that was the problem. A peuter sat on the mat, the pedagogisch medewerker was three meters away with another child, and the agent did its thing on the screen. It recognized speech in a kid-pitched model, branched the story on the peuter’s reply, and logged a woordenschat-event to a Postgres table that the koepel’s centrale opleidingscoach reviewed weekly.
It worked, in the narrow sense. The peuters reached more vocabulary events per hour than they did during free play. The coach had numbers she could put in a report. The dashboard was green.
What it did not do was sit inside the pedagogisch contact between the medewerker and the child. The medewerker became, in practice, a referee for the device. Eye contact dropped. The activity stopped being a shared event and became a delivery channel for a vendor’s content tree. Two pedagogisch coaches inside the koepel had flagged this twelve months earlier. The dashboard had been greener than the room.
The Norwegian guidance gave the board a reason to act on what its own staff had already said. We were called in the week after the shelves were cleared.
The new shape: medewerker in the loop, RAG behind her
We rewrote the architecture around a single rule: no model output reaches a child under four directly. Every generated suggestion lands in front of an adult, who decides whether it becomes an activity.
The new system is a planning assistant for the pedagogisch medewerker, not a play partner for the peuter. She opens it on the iPad in the kantoor corner during the dagopening — five minutes before the children arrive — and asks for a spelactiviteit suggestion for that morning’s group. The agent pulls from three sources: the koepel’s own pedagogisch werkplan (a roughly 80-page PDF that has been indexed), the SLO doelen voor het jonge kind scoped by domain (taal, motoriek, sociaal-emotioneel, rekenen), and the vorige-week observatie-notities from that specific group, written by the medewerker in plain Dutch.
It returns three suggestions. Each one cites — inline, with the SLO-code — which doel it advances and why. The medewerker picks one, edits it, or rejects all three and writes her own. The agent does not autorun. It does not push notifications. It has no role once the children are in the room.
The AI did not leave the building. It moved from the child’s lap to the worker’s morning prep. That is the shape almost every “AI for kids” product needs.
Citing the SLO doelen, every time
The non-negotiable from the pedagogisch coaches was that every suggestion must trace back to a specific SLO doel. Not “this supports language development” — that is a marketing sentence. We needed something like “this advances SLO-doel T-2.3 (uitbreiding woordenschat met categorienamen) for a 3-jarige in the bovenbouw groep.”
So the RAG layer is doubled. The first retrieval pulls candidate activities from the koepel’s werkplan and historical observaties. The second retrieval, run on every candidate, asks the SLO index: which 0–4 doel does this map to, and at what age band? If the second retrieval returns nothing, the candidate is dropped before the medewerker ever sees it.
The prompt is unromantic. Here is the relevant slice:
# In the RAG layer: drop any candidate activity that
# cannot be grounded in an SLO doel for the age band.
def cite_or_drop(candidate: Activity, group: Group) -> Suggestion | None:
matches = slo_index.search(
text=candidate.description,
age_band=group.age_band, # "0-1", "1-2", "2-3", "3-4"
domains=group.focus_domains, # set from the werkplan
k=3,
)
grounded = [m for m in matches if m.score >= 0.62]
if not grounded:
return None # no SLO citation, no suggestion
return Suggestion(
activity=candidate,
citations=[
Citation(code=m.code, title=m.title, snippet=m.snippet)
for m in grounded
],
)
The 0.62 threshold is not theoretical. We tuned it on 240 historical activity logs the koepel already had, hand-labeled by one of the pedagogisch coaches over a weekend. Below 0.62, the false-positive rate on SLO mapping crossed 18 percent. Above 0.72, useful suggestions started getting silently dropped. The medewerkers see citations they can defend; we see a hallucination rate they can defend against.
The 16-week parallel cohort
The board did not want a vibes-based win. The medewerkers did not want to be the test group for a vendor’s enthusiasm. So we ran a 16-week parallel cohort across the eleven vestigingen, starting in March.
Six vestigingen ran on the new worker-side RAG. Five ran on the play-based plan that had been in place before the original tablet-agent rollout: no tablets, no agent, no AI in any form. Same staff training cadence, same werkplan, same parent communication. The koepel’s centrale coach and an external pedagoog from a Tilburg pabo handled the assessment.
The measured outcome was per-kind taalontwikkelings-delta, the change in a child’s score on a standardized Dutch taaltoets between the start and end of the 16 weeks. Not a group average. Not engagement minutes. A delta per child, on the same instrument the GGD recognizes for kwaliteitsmetingen.
The numbers we can share, with the koepel’s permission:
- The RAG-cohort showed a median per-kind delta of +0.41 on the 0–4 schaal; the controle-cohort showed +0.34.
- The variance in the RAG-cohort was lower — the bottom-kwartiel kinderen moved more than they did in the controle-cohort.
- The medewerkers in the RAG-cohort reported spending 11 minutes fewer per dag on activity planning, recovered to direct contact with the children.
The delta is small. We are not claiming a revolution; we would fail our own brief. What it shows is that you can remove the kid-facing AI without losing the developmental signal, as long as you put the AI somewhere useful instead. The reclaimed planning time is the more honest win.
What the GGD actually asks for
Under the Wet kinderopvang, a koepel has to show the GGD inspector that its pedagogisch handelen is grounded in a written werkplan and that activities are traceable. The old tablet-agent made this harder, not easier. The activities were a vendor’s content tree, the link to the koepel’s own werkplan was vague, and the inspector had to take it on faith.
The worker-side RAG produces a per-activity audit row the moment the medewerker accepts a suggestion. Each row stores which suggestion was offered, which SLO-codes it cited, which one the medewerker accepted or what she wrote instead, and the date. When the GGD walked into one of the vestigingen in May, the locatiemanager pulled the last twelve weeks of audit rows into a PDF in under two minutes and walked the inspector through three randomly chosen activities. The inspector’s note used the word navolgbaar, traceable, which is exactly the word the law uses.
One warning, hard-won. The audit table is a legal artifact, not a metrics dashboard. Keep them in the same Postgres schema and you will eventually drop a column for a product reason and break the bewijslast. Separate database, separate retention policy, separate backup.
The pieces we underestimated
Three things bit us, and they were the parts of the project that had nothing to do with the model.
The werkplan was not really a document. It was a Word file that had been forwarded between locatiemanagers for six years, with conflicting versions on three SharePoint drives. Before the RAG could cite it, we spent eight days with the centrale coach merging it into one canonical PDF the koepel actually signed off on. This was the most useful eight days of the project.
The medewerkers were skeptical for the right reasons. Two had been at the koepel for fifteen years and had watched three software cycles come and go. They were not anti-tech. They were anti-being-evaluated-by-a-dashboard-they-did-not-design. We let them rewrite the suggestion-card layout on paper before we built it. The interface they drew is the interface we shipped.
The Norwegian guidance was not the whole story. The koepel’s parents were split, some had liked the tablet-agent, some had not, and the communication to ouders mattered more than the architecture. The board sent a one-page letter, in Dutch, explaining what changed and why. The letter cited the koepel’s own pedagogisch coaches before it cited Oslo. That is the right order.
Reliability is not a model property
The Hacker News thread on building reliable agentic AI systems was on the front page the week we shipped the RAG. The discussion there, like most of the good writing on the topic, kept returning to a single point: reliability is a property of the system around the model, not of the model itself. Anthropic’s “Building effective agents” writeup makes the same case — prefer the smallest composition of well-scoped pieces, and put a human in the loop wherever the cost of being wrong is high.
Toddlers under four are the highest such cost we know of. The model in this system is a small one, the retrieval is boring, and the loop closes on an adult who has a pedagogische opleiding and fifteen years of experience. The intelligence of the whole thing is in the boring parts.
What you can do this week
When we built the worker-side RAG for the Breda koepel, the thing we ran into was that pulling the kid-facing AI was the easy part. What mattered was finding the adult workflow the model could actually help, and writing the citation rule that made the suggestions defensible to a GGD inspector. That kind of AI agent work, narrow, human-in-the-loop, grounded in a real document, is most of what we ship.
If you run a service that touches children, patients, or anyone else who cannot consent to being evaluated by a model: open your product map this afternoon and mark every place a model output reaches the end user directly. For each one, ask whether there is a workflow one step earlier where an adult could intercept it. Most of the time, there is.
Key takeaway
Move the AI from the child's lap to the worker's morning prep. That is the right shape for almost every product aimed at someone who cannot consent to being evaluated by a model.
FAQ
Why pull the tablet-agent if the children liked using it?
Engagement was never the question. The pedagogisch coaches had been flagging dropped eye contact and a shift from shared activity to device delivery for a year. The Norwegian guidance gave the board a reason to act on what staff had already said.
What did the GGD inspector actually look at?
Twelve weeks of audit rows, exported to PDF in under two minutes: which suggestion was offered, which SLO-codes it cited, and what the medewerker did with it. The inspector called the trail navolgbaar, the word the Wet kinderopvang uses.
Can a worker-side RAG work without a clean written werkplan?
Not well. We spent eight days merging six years of conflicting Word files into one canonical PDF before the RAG could cite anything. The document work is not optional. It is most of the project.
What model is behind the system?
A small one. The intelligence is in the retrieval, the SLO citation rule, and the human in the loop. Swapping the model out for a competitor would change very little about the behavior the koepel and the GGD care about.
Did the children in the RAG-cohort actually do better on the taaltoets?
Median per-kind delta was +0.41 in the RAG-cohort versus +0.34 in the controle-cohort over 16 weeks, with lower variance. Small absolute difference. The honest win was 11 minutes per day returned to direct contact.