Voice agents

Voice agents for trades: a Tilburg roofing case study

A Tilburg roofer's office manager used to spend Friday afternoons typing paper job cards into Exact Online. Now her crews talk and the lines appear by morning.

Jacob Molkenboer· Founder · A Brand New Company· 7 Jun 2026· 9 min

Vintage black bakelite phone off-hook on leather blotter, cream ribbon, green sticky tab, twine-tied job cards.

It is 17:42 on a Thursday in March. Karin, the office manager at a 34-person roofing contractor near the Wilhelminakanaal in Tilburg, has a stack of forty-one paper job cards in front of her. Each one is a damp A5 sheet with a scrawled address, a line about EPDM 1.5 mm, 38 m², a couple of guesses about hours, and a signature from a homeowner who wanted the crew off the roof before the rain came back. She has until Monday morning to turn that stack into invoice lines in Exact Online, otherwise the May billing run slips and the bookkeeper starts sending pointed emails.

This is the company that hired us in February. They do flat-roof renovations, mostly insurance-driven, mostly within a forty-kilometre radius. Revenue around 6.2 million euro. Fourteen vans, six two-person crews, four estimators, a foreman, Karin, and a director who has been threatening to buy a field-service SaaS for three years and never pulled the trigger because the quotes all started at 18k setup and locked the crews into a tablet workflow they would never actually use.

What we built instead is a voice agent. The crews talk into their phones for ninety seconds at the end of each job. By the time Karin opens her laptop the next morning, the lines are in Exact Online, draft status, ready for her to glance at and release. Below is how it works, what broke, and what we would do differently.

Why paper job cards survived this long

Before we get to the agent, it is worth being honest about why the paper system was sticky. The crews hated tablets. The previous attempt, a Dutch field-service product whose name we will not drag through the mud, had a six-screen flow per job: select customer, select project, pick materials from a dropdown of 1,400 SKUs, enter quantities, enter hours per person, sign, sync. On a windy roof, with gloves on, in February, this is a non-starter. The foreman's exact phrase was "ik ben dakdekker, geen datatypist". The tablets ended up in the glovebox.

Paper worked because it had one interaction model: write something, anything, and hand it to Karin. The cost was that Karin became the integration layer. She was reading handwriting, looking up SKUs, fixing crew arithmetic, and reconciling against the original quote. Three to four hours per Friday, plus the Monday morning panic when a card went missing.

The shape of the voice agent

We did not build a chatbot. We built something closer to a one-button dictaphone with a brain attached. The crew opens a PWA we shipped to their phones, taps a big orange circle, and talks. There is no form. There is no dropdown. The prompt the crew hears in their earpiece is the same every time:

"Welke klus, welk adres, wat heb je gedaan, hoeveel materiaal, hoeveel uur per man. Praat gewoon."
Onboarding script we recorded with the foreman, 2026

A typical recording is somewhere between sixty and a hundred and twenty seconds. It sounds like this, lightly anonymised:

"Ja, dit is Mo, klus Van Goghlaan 14 in Goirle, 
we hebben de hele achterdakkapel gedaan, 
ongeveer tweeënveertig vierkante meter EPDM anderhalve mil, 
drie rollen, twee tubes Bostik, één afvoer vervangen, 
ik en Driss vier uur, schuif maar onder de offerte van 
vorige week, die met die loodslab."

That recording goes through four stages: transcription, extraction, matching, and posting. Each one is its own model call, because trying to do all of it in a single prompt produced confident hallucinations about SKUs that did not exist.

Stage 1: transcription

We use a Dutch-tuned Whisper variant running on a Hetzner GPU box. Cloud STT was tempting but the crews mumble, the wind hits the mic, and the regional accent around Tilburg eats the end of words. A model we can fine-tune on our own corpus of ninety job recordings outperformed every off-the-shelf API we tested. The Whisper paper and code is the obvious starting point if you want to do this yourself.

Stage 2: extraction

The transcript goes to an LLM with a strict JSON schema. We do not let the model invent fields. The schema is the entire contract:

{
  "job_reference": "string | null",
  "address_hint": "string | null",
  "work_summary": "string",
  "materials": [
    { "description": "string", "quantity": "number", "unit": "string" }
  ],
  "labour": [
    { "worker_hint": "string", "hours": "number" }
  ],
  "quote_link_hint": "string | null",
  "confidence": "low | medium | high"
}

The confidence field is the most important one. Anything below high gets flagged for Karin to glance at before release. We tuned the prompt for two weeks until the model started honestly saying low when the crew member talked over a chainsaw, instead of guessing.

Stage 3: matching

This is the stage everyone underestimates. "Three rolls of EPDM 1.5 mm" is not an Exact Online line item. It needs to become SKU EPDM-150-RL20, quantity 3, unit rol, price from the customer's current contract. We built a small matching service that holds the SKU catalogue in Postgres with a pg_trgm index on description plus a vector embedding of the marketing name. The crew's words get embedded, top-k candidates come back, and a second LLM call picks the right one with the price sheet in context.

Address hints get resolved against the open jobs list. "Van Goghlaan 14 in Goirle" matches a project that already has a quote attached, so the new lines slot under that project. If nothing matches above 0.82 cosine similarity, the job goes to a needs human queue. That queue averages two jobs a day.

Stage 4: posting to Exact Online

Exact Online's REST API is fine once you accept that the OAuth token expires every ten minutes and the rate limit will bite you the moment you try to backfill anything. We post draft sales invoice lines via the Exact Online REST resources, never auto-finalise. Karin sees a list of drafts every morning, ordered by confidence, with the original audio clip embedded as a play button. One click to release, one click to send back for clarification.

Warning

Do not auto-finalise invoice drafts. Even at 98% extraction accuracy, the 2% that go wrong are the ones a customer will remember for two years. Keep a human on the release button until you have six months of clean data.

What broke in week one

The first week of production was instructive. Three things broke that we did not anticipate.

First, the crews recorded jobs in the van, on the way to the next address, with the radio on. Sky Radio at 80 km/h is a surprisingly hostile acoustic environment. We added a thirty-second pre-roll silence check and a gentle nudge in the PWA: "Te veel achtergrondgeluid. Even pauzeren?". Recording quality went up overnight.

Second, one crew member, an older roofer with thirty years on the job, simply refused. He would not record. The foreman's solution was elegant: he paired him with a younger colleague who recorded on his behalf, narrating what the older roofer had done. The lesson is that adoption is a social problem, not a technical one, and we should have spent more time on the kickoff conversation.

Third, the model started inventing materials. A crew member said "en wat tape" and the model confidently produced "Bostik flexibele afdichtingsband, 1 rol, 25m". We had not given it a fallback for vague terms. The fix was a rule: if the quantity or unit is missing from the audio, the line goes to needs human regardless of confidence score. The hallucination rate on materials dropped to roughly one line per two hundred.

The numbers after ten weeks

We have ten weeks of production data. Karin's Friday afternoon went from three to four hours of data entry to about thirty-five minutes of review and release. The May billing run closed on the Wednesday of the following week instead of the Friday. Cash collection moved forward by roughly nine days on average, which on a 6.2 million euro book is real money even at a conservative cost of capital.

The crews record about ninety-two percent of jobs. The remaining eight percent are mostly very short visits where the crew skips the recording and Karin pulls from the original quote. We are fine with that. The goal was never one hundred percent coverage, it was killing the Friday backlog.

Takeaway

The win was not the AI. It was deleting the form. The voice agent worked because the crews already knew how to talk, and we stopped asking them to do anything else.

Where this does not work

A few caveats before anyone tries to copy this wholesale. This setup works because the company has a tight SKU catalogue, a single ERP, and an office manager who already knew the business cold. If you have three operating companies, a sprawling product list, and nobody who can spot a wrong line at a glance, you will not get the same result. The voice agent is a compression layer between the field and the back office, and the back office still has to know what "correct" looks like.

We also want to be honest that this is not magic. There is a real ongoing cost. Whisper inference, LLM calls, embeddings, the PWA, and the Exact Online sync land somewhere around 380 euro per month at current volumes. That is comfortably under the salary cost it replaced, but it is not free, and it grows with usage.

What we would do differently

If we were starting again, we would build the needs human queue first, before the happy path. The queue is where trust gets earned or lost. We spent the first sprint making the extraction excellent and the second sprint making the review screen tolerable. We should have done it the other way around.

We would also skip the PWA and start with WhatsApp voice notes. The crews already use WhatsApp. The PWA gave us a cleaner audio pipeline but cost us two weeks of onboarding friction. For the next contractor we talk to, WhatsApp-in, Exact-out, is the v1.

When we built this voice agent for the Tilburg roofer, the thing we kept running into was that every off-the-shelf field-service product wanted to change how the crews worked. We ended up solving it by changing nothing about the crews, and putting the entire burden of structure on the model and the queue behind it.

If you want to try the smallest version of this today: pick one job type, ask one crew member to send you a voice note at end of day for a week, and transcribe them by hand. Read the transcripts on Friday. You will know within an hour whether the structure is extractable, and you will not have written a line of code.

Key takeaway

The win was not the AI, it was deleting the form. The voice agent worked because the crews already knew how to talk, and we stopped asking them to do anything else.

FAQ

Why not just use WhatsApp voice notes from day one?

You can. We used a PWA for cleaner audio and a single-button UX, but for a v1 a WhatsApp number that pipes into Whisper and Exact Online will get you 80% of the value in a week.

How accurate is the extraction in practice?

After tuning, roughly 92% of jobs flow straight to draft invoice lines. The other 8% land in a review queue. Material hallucinations are about 1 in 200 lines.

What does this cost to run per month?

Around 380 euro at the contractor's current volume: Whisper inference on a Hetzner GPU, LLM calls for extraction and matching, embeddings, the PWA, and the Exact Online sync.

Does it auto-post invoices to customers?

No. Everything lands as a draft in Exact Online. The office manager reviews and releases. We strongly advise against auto-finalising for at least the first six months.

Does it work in Dutch with regional accents?

Yes, but only after fine-tuning. Off-the-shelf cloud STT struggled with the Brabant accent and wind noise. A Dutch-tuned Whisper variant on our own corpus outperformed every API we tested.

voice agentscase studyprocess automationintegrationsoperationsai agents

Building something?

Start a project