Voice agents

Voice agents in home care: shipping a Wkkgz-safe line

Building a voice agent that takes 220 medication callbacks a day under Wkkgz, and lands red flags inside the wijkverpleegkundige's Nedap Ons session before she moves on to her next visit.

Jacob Molkenboer· Founder · A Brand New Company· 25 Mar 2025· 9 min

Black bakelite phone receiver off-hook on oxblood leather blotter, green ribbon on cream card, brass bell behind.

The 7:15 call

It is 7:15 on a Tuesday. The wijkverpleegkundige on call sits at a client's kitchen table, tablet open in Nedap Ons, ticking off the morning visit. Her work phone vibrates. Three streets over, the voice agent has just run a medication check. The client paused for nine seconds when asked about her insulin, then said, "Ik denk dat ik 'm al heb gehad." A handoff card appears inside the Ons session the nurse already has open. She does not switch app. She does not open a paper sheet. She taps through to the client record and calls back.

That moment, the inside-the-Ons-session handoff, is the entire reason this project shipped. Everything else (the model, the prompts, the call routing, the Dutch ASR) was solvable. The hard problem was that a nurse on a moped between visits cannot be asked to remember three apps.

We built this voice agent for a mid-sized Dutch thuiszorg provider that runs about 220 medication callbacks a day across two regions. The agent calls the client, asks four questions, listens, and either marks the call clean or escalates a red flag to a named nurse on call. Everything sits under Wkkgz duty-of-care, which means the audit trail has to survive an IGJ visit and any escalation has to land with a human inside the working minute.

What follows is the playbook, in the order we built it.

The Wkkgz constraint

Wkkgz (Wet kwaliteit, klachten en geschillen zorg) is not a checklist. It is a duty to deliver "goede zorg," and it puts the named caregiver on the hook for what any system does on their behalf. For a voice agent, three things matter.

First, the agent cannot give clinical advice. It can ask, listen, log, and escalate. The moment it nudges toward a clinical decision (take it now, skip the dose, double up) you have crossed a line the inspectorate will draw at audit time. We wrote this into the system prompt as a hard refusal, and into the post-hoc classifier as an alarm.

Second, every interaction needs a human-readable trail in a system the provider already runs. Not a parallel dashboard. Not a CSV in S3. The nurse's existing client record. For this provider that meant Nedap Ons, the dominant ECD in Dutch home care.

Third, an escalation is not a notification. It is a named handoff. "Someone should look at this" is not Wkkgz-safe. "Marieke is on call until 10:00, this is in her queue, here is the transcript" is.

If your voice agent can fail silently into a void, it is not a healthcare voice agent. It is a liability. Build the escalation path before you build the model.

Intent model, not script tree

The first version we scoped was a decision tree. Question one branches into A or B, question two into C or D, the agent walks the tree. We threw it out after two days of pilot calls.

Elderly clients do not answer in branches. They answer in stories. "Nou, de pillen lagen op het aanrecht, maar mijn dochter kwam langs en ik denk dat ze ze heeft verplaatst, en toen had ik mijn koffie, en ja de oranje heb ik volgens mij wel gehad." A tree cannot eat that. An intent model can.

We landed on five intents the agent classifies after each turn:

confirmed: the client clearly took the medication on schedule
refused: the client explicitly chose not to take it
uncertain: the client does not remember, or contradicts herself
side-effect: the client mentions a symptom (nausea, dizzy, fell)
distress: shortness of breath, confusion, pain, "ik voel me niet goed"

Confirmed and refused close the call. The other three escalate, with different urgency.

type Intent =
  | "confirmed"
  | "refused"
  | "uncertain"
  | "side-effect"
  | "distress";

interface TurnResult {
  intent: Intent;
  confidence: number;        // 0..1
  evidence: string;          // verbatim quote from transcript
  red_flags: string[];       // e.g. ["mentions_fall", "slurred_speech"]
  escalate: boolean;
  urgency: "none" | "minutes" | "now";
}

The model returns this after every patient turn, not just at the end. If urgency hits now mid-call, the agent stops the script, says a single calm line ("Ik laat een verpleegkundige u zo terugbellen, blijft u alstublieft aan de lijn als dat lukt"), and fires the handoff. It does not try to finish the questionnaire.

The Nedap Ons handoff

This is where most voice-agent projects in Dutch healthcare fall over. The agent can be perfect. If the result lives in a separate dashboard the nurse has to log into, the workflow is dead on arrival.

Nedap Ons exposes an API and a webhook surface for partners. The shape is: client identified by a BSN-derived hash, a structured note posted to the client's dossier with a typed category, and a task assigned to the on-call nurse with a deep link back to the transcript. The handoff payload looks roughly like this:

{
  "client_ref": "ons:client/8f3a...",
  "event_type": "medication_check.escalation",
  "urgency": "minutes",
  "summary": "Cliënt weet niet zeker of insuline is ingenomen. Pauze van 9 sec, antwoord tegenstrijdig.",
  "transcript_url": "https://agent.zorg.example/calls/2026-06-09T07-15-12Z",
  "intent": "uncertain",
  "evidence": "Ik denk dat ik 'm al heb gehad",
  "assigned_to": {
    "kind": "on_call_role",
    "role": "wijkverpleegkundige_oost",
    "fallback_after_minutes": 4
  },
  "agent_version": "voice-2026.06.02"
}

Three details earn their keep here.

The summary is in Dutch and reads like a nurse wrote it. We tested English summaries and they were ignored. The on-call nurse will see this on a 6.1-inch screen between visits. It has to land in two seconds.

The fallback timer is not a nice-to-have. If the assigned nurse has not opened the task in four minutes, it escalates to the team lead. We chose four minutes after watching the actual response-time distribution in the pilot. Yours will be different.

The agent_version sits in the payload because the IGJ audit will eventually ask which version of the agent made a given decision. Bake it in from day one.

Voice stack choices

The voice stack itself is the boring part. A SIP trunk into a telephony layer that streams audio bidirectionally, a Dutch-tuned ASR, a model in the middle for intent classification and response generation, and a Dutch TTS that does not sound like a satnav from 2012.

A few things we learned the hard way.

Dutch ASR on elderly voices, on a 4G mobile call, with a budgie in the background, is not the same problem as the demo on the vendor's marketing page. We disabled barge-in by default (let the client finish), pushed the silence-detection threshold to 1.8 seconds, and built a small custom vocabulary for medication brand names (Metformine, Sintrom, Furosemide, and friends). The vocabulary alone cut misrecognitions sharply in our internal eval set. We will not publish the percentage because the eval set is small and not yours.

Latency matters more than model quality past a threshold. A 400ms response feels like a conversation. An 1100ms response feels like the line is dropping. The client hangs up. We picked a smaller, faster model for turn-by-turn intent classification, and a larger one for the offline post-call summary that lands in Ons.

Numbers and times are the failure mode nobody warns you about. "Half acht" (7:30) parses fine. "Tegen achten" (around eight) does not. Build the test set on real recordings before you trust any number the model extracts.

The red-flag taxonomy

The red flags are not arbitrary. The clinical team owns this list. We facilitated, they decided. After three workshops the production taxonomy looked like this:

Immediate (urgency: now): chest pain, shortness of breath, confusion, mention of a fall in the last hour, slurred speech, "ik voel me niet goed" with no further detail.
Within minutes (urgency: minutes): uncertainty about an anticoagulant or insulin dose, refused dose without explanation, mention of nausea or dizziness, mention of a fall older than an hour.
Within the visit (urgency: none, but flagged): tolerable side effects, mood-related signals, social isolation cues.

The model's job is to classify into one of these. The nurse's job is to decide what to do about it. We never let the agent suggest the action.

Audit trail for IGJ

Every call is logged with the audio recording, the streaming transcript, the turn-by-turn intent classifications with confidences, the final summary, the handoff payload if any, and the model versions involved. Storage sits in EU-region buckets with a 7-year retention to match the medical record. Access is gated by the same SSO the provider already uses for Ons.

We expose a read-only auditor view that lets a quality officer pull any call, see the full chain, and export a PDF. We have not yet had an IGJ visit on this client, but we built the export against the digital-health inspection guidance published openly by IGJ.

Failure modes we planned for

Healthcare AI is not the place to learn that your agent can hallucinate. We knew it would. The question was: what do we let it do when it does?

Three guardrails matter most.

The agent has a hard refusal list. Any clinical instruction, any dose advice, any "yes I think you should skip it" gets caught by a post-generation classifier and replaced with the calm escalation line. We log the refusal so we can tune the prompt.

The agent has a max-turn budget. Six turns and the call ends, escalation or not. This protects against loops where a confused client and a confused model spiral together.

The agent has no tool access beyond logging and handoff. It cannot call the pharmacy. It cannot reschedule a visit. It cannot send an SMS. There has been a steady run of front-page stories this month about agents running amok inside developer environments, and a parallel debate about how much retention is reasonable for the providers behind them. None of that changes a healthcare brief. You assume the model will misfire. You build a system where the worst it can do is log a transcript and ring a nurse.

Takeaway

A healthcare voice agent earns its keep at the handoff, not at the conversation. Optimize the second the human takes over, not the ninety seconds before.

What changed after week two

Two weeks into the pilot we cut three questions from the questionnaire. They were technically useful, clinically interesting, and made the call thirty seconds longer. Thirty seconds was the difference between an 85% completion rate and a 71% completion rate. We kept the four questions the on-call nurse actually used, and dropped the rest into the post-visit form a nurse fills in anyway.

We also changed the opening line. The first version started with, "Goedemorgen, ik bel namens [provider] voor uw medicatiecontrole." Clients hung up. They thought it was a survey. The second version said, "Goedemorgen mevrouw De Vries, ik bel even kort om te controleren of de ochtendmedicatie goed is gegaan." Clients stayed on. Specifics beat scripts, even in a robot voice.

The smallest thing you can do today

If you run an operation with recurring outbound calls under any duty-of-care frame, sit with the person who answers the escalations for one shift. Watch which app they keep open. That app is where your handoff has to land. Not your dashboard. Theirs.

When we built this voice agent for the thuiszorg provider, the thing we kept circling back to was that the model was the easy part. The Nedap Ons handoff and the Wkkgz audit trail were what made the project shippable. That is usually how it goes.

Key takeaway

A healthcare voice agent earns its keep at the handoff, not at the conversation. Optimize the second the human takes over, not the ninety seconds before.

FAQ

Can a voice agent legally give medication advice under Wkkgz?

No. Wkkgz duty-of-care sits with the named caregiver. A voice agent can ask, listen, log, and escalate. Anything that reads as clinical advice has to be caught and routed to a human.

Why route into Nedap Ons instead of a separate dashboard?

Nurses already live in Ons during a shift. A parallel dashboard means one more app to open and one more thing to forget. Inside-Ons handoffs land in the workflow the nurse is already in.

How fast does the escalation handoff need to be?

For urgency 'now' you want a named nurse seeing the card inside the working minute. We use a four-minute fallback to a team lead, tuned to the real pilot response-time distribution.

What happens if the client does not pick up?

Two retry attempts spaced fifteen minutes apart. After the third failed attempt the call is flagged in Ons as 'no contact' so the on-call nurse can decide whether to visit in person.

Is a recording of the call required for the audit trail?

Yes, alongside the transcript, the turn-by-turn classifications, the final summary, and the model versions. Store it EU-region with 7-year retention to match the medical record.

voice agentsai agentsautomationintegrationscase studyoperations

Building something?

Start a project