RAG

Field-service RAG: replacing paper reports at an HVAC firm

It is 18:40 on a Tuesday in Eindhoven. The office manager has 23 paper service reports on her desk, each one a different handwriting, all due in Exact Online by morning.

Jacob Molkenboer· Founder · A Brand New Company· 8 Jun 2026· 9 min

Open index-card drawer on ivory desk blotter, manila cards fanned with one chartreuse tab card and brass divider.

18:40 on a Tuesday in Eindhoven

The office manager has 23 paper service reports on her desk. Each one a different handwriting. Each one due in Exact Online by tomorrow morning, because the bookkeeper closes the week on Wednesday. She picks up the top sheet. The installer wrote Vitodens 100-W, gewisseld, plus 3x flexkoppeling 22mm, 1.5u arbeid, klant tekent voor akkoord. She opens the project in Exact, types the materials, picks the hour code, attaches the scan of the form. Forty-three keystrokes per job. She has done this every weekday evening for eleven years.

This is the problem we were asked to fix in February. The installer is a 41-person HVAC firm in the Eindhoven region. Twelve service vans, three apprentices, one office. Their tools are excellent: branded vans, calibrated leak detectors, certified installers, a Volvo dealer plate over the workshop door. Their data flow ends at a clipboard.

Why the paper survived for so long

Every previous attempt to digitise their service reports had failed for the same reason. The forms are not forms. They are notes. An installer writes ketelaanvoer 62 graden, retour 41, modulatie ok, klant klaagde over tikken in radiator 1e verdieping, niet gevonden. That sentence has to become: one labour line, possibly one materials line if anything was swapped, a project memo for the next service round, and a follow-up flag for the planner. No off-the-shelf field-service app does that translation. They give you a form with checkboxes and the installers ignore it, because the customer is standing next to them and wants to talk about the noise in the radiator.

So the firm kept the paper. The cost was paid by the office manager, after hours. Two previous vendors had pitched them an iPad app and a tablet-mounted form builder. Both pilots collapsed inside a month. The installers refused to type. They wrote on paper and then re-entered the form on the tablet at 18:00 in the depot. The tablet added a step, it did not remove one.

The shape of the rebuild

We did not build an app. We built a small assistant that lives on the installer's phone, takes a photo plus a 20-second voice note at the end of each job, and produces a draft Exact Online entry the office manager approves on her laptop the next morning. The voice note is the load-bearing piece. Installers will not type on a phone with cold hands at 17:00 in November. They will talk to it.

The flow looks like this:

phone (PWA)
  ├─ photo of the paper sheet (still required for warranty)
  ├─ 20s voice note in Dutch, often Brabants accent
  └─ structured fields: project number, end time, signature scan
            │
            ▼
edge function (Supabase)
  ├─ Whisper transcription (nl) with custom vocab
  ├─ RAG over: SKU catalog, hour codes,
  │             customer contract terms, brand glossary
  └─ structured draft → Exact Online (staging table)
            │
            ▼
office laptop (web UI)
  ├─ side-by-side: paper scan | draft entry
  ├─ one-click approve → push to Exact
  └─ rejected entries feed a correction log

The PWA was a deliberate choice. Installers update their Android phones once every two years if at all. A native app would have meant a forced-upgrade fight every time the wholesaler's SKU catalog changed, which it does on its own schedule. The PWA refreshes silently in the background and we own the rollout cadence.

Why this is genuinely a RAG problem

An LLM alone gets you sixty percent of the way there. It will turn drie flex 22mm into a plausible materials line and assign it to a plausible hour code. Plausible is not good enough for accounting.

The materials catalog has 4,200 SKUs. There are six different flex 22mm SKUs depending on whether the connector is for water, gas, central heating, solar, or one of two warranty-grade variants. The wrong choice means a 12 EUR margin error per van per day. Multiplied across twelve vans and 220 working days, that is the office manager's salary.

So we ground the agent on the firm's own three sources of truth: the SKU table from their wholesalers (Technische Unie and Itho Daalderop), the hour-code list mirrored from Exact, and a brand glossary we wrote by hand with the senior installer over two afternoons in the workshop. The glossary is the unsung hero. It maps Vito to Viessmann, Inti to Intergas, ketel to the correct boiler-line family, and a long list of regional mispronunciations to the products they actually mean. The original RAG paper from Lewis et al. framed retrieval as a way to inject factual grounding into generation. In a 41-person installation firm, the facts are: which SKU, which hour code, which customer contract clause.

The grounding step itself is small. The work is in keeping the indexes honest.

def ground_voice_note(transcript: str, project_id: str) -> dict:
    sku_hits   = sku_index.search(transcript, k=8)
    hours      = hour_codes_for_project(project_id)
    glossary   = brand_glossary.expand(transcript)
    contract   = contract_terms_for(project_id)

    prompt = build_prompt(
        transcript=transcript,
        sku_candidates=sku_hits,
        hour_codes=hours,
        glossary_hints=glossary,
        contract_clauses=contract,
    )
    return draft_entry(prompt)  # returns Exact-shaped JSON

The Brabants accent problem

Whisper is good at Dutch. It is less good at the soft, sing-song Brabants Dutch spoken by an installer leaning against a Vitodens with a fan running in the background. The first pilot week showed a 14% word-error rate on installer audio, against 4% on the office manager's reference recordings. That gap is the difference between an agent that drafts useful entries and one that hallucinates SKUs.

We did two things. We fine-tuned a small custom-vocabulary file (about 600 terms: SKU stems, brand families, installer slang) and fed it into the transcription step. And we added a confidence floor on the transcript itself. If Whisper's logprob for any named entity in the audio drops below a threshold, the agent does not guess. It refuses the draft and pings the installer to re-record. After the vocab file landed, the word-error rate on installer audio dropped to 6%. Good enough.

The Exact Online integration

Exact Online is the boring half of this story and the half that took 60% of the build. Their REST API is functional but not generous. A single project-hour entry needs the project GUID, the item GUID, the hour-type GUID, a date, a cost-centre, and (for materials) a separate sales-invoice line if the customer is on a per-call contract. Six lookups before you can write one row.

We cache aggressively. The project list refreshes every fifteen minutes. The materials catalog refreshes once a night against the wholesaler feed. Hour-types and cost-centres never change, so they live in a static table the agent reads from memory. On the agent's hot path there is exactly one Exact write per approved entry, and zero reads. That is what lets a draft turn into a posted line in 22 seconds.

Warning

Exact Online's API enforces a per-division rate limit that is not visible in the UI. We hit it in week two of pilot when the office manager batch-approved 60 entries in one minute. Stagger writes to no more than eight per second per division, and read the X-RateLimit headers on every response. A retry storm against Exact will get your OAuth token cooled for the rest of the day.

The 22 seconds, broken down

The headline number is precise because we instrumented every step:

Voice note upload and Whisper transcription: 4.1s median
RAG retrieval over the three indexes: 0.7s
Draft generation with the grounded context: 6.8s
Office manager review of the side-by-side view: about 9s of human time
Approve, then Exact write, then confirmation: 1.4s

Total wall-clock from voice note to posted in Exact: 22 seconds median, 38 seconds p95. The old workflow took two minutes and forty seconds per job on a good day, and that did not include the scanning step.

Where the 14 hours came from

The office manager logged her time for two weeks before and four weeks after. Before: 17.5 hours per week on service-report processing, including the scanning, the data entry, the chasing of installers for unreadable handwriting, and the Wednesday-morning reconciliation with the bookkeeper. After: 3.5 hours per week, almost all of it spent on the side-by-side review and the rejected-entries log. Net saving: 14 hours per week. She now leaves at 17:00 instead of 19:00, four days out of five.

The installers saved time too, though less. Average end-of-job admin dropped from six minutes (writing the form, getting the signature, photographing the sheet for their own records) to two minutes (photographing the paper sheet that is still required for warranty, plus the 20-second voice note). The paper is still there because the warranty terms from two of their boiler suppliers explicitly require a paper signature on a sheet of the supplier's format. We did not fight that battle. It is the wrong one to fight in year one.

The refusal logic that earned the manager's trust

The agent refuses to draft an entry under four conditions: voice note shorter than four seconds, photo unreadable below a sharpness threshold, transcript confidence below the floor on any named entity, or a SKU candidate whose top match has a similarity score under 0.72. The refusal rate stabilised at 6%, almost all of it on jobs that ran past 18:00 in the dark.

Those refusals go back to the installer's phone with a single line: opnieuw opnemen, foto te donker. The installer fixes it from the van within 30 seconds. That single guardrail was the difference between an agent the office manager trusts and one she has to babysit. An agent that gets 94% of jobs right unattended is worth more than one that gets 98% right but the manager can never tell which 2% are wrong without reading every entry.

Lessons from the rebuild

Three things we would do again. First, ship as a PWA. Installers update phones on geological time. A PWA gets fixed instantly when the catalog changes. Second, write the brand glossary by hand, with the senior installer in the room, before touching the model. The glossary is the firm's institutional memory and it never appears in any catalog. Third, treat the ERP write path as the hardest part of the system. The agent can be wrong and recover. A double-posted line in Exact cannot.

Two things we would do differently. The voice-note buffer was originally 60 seconds, which is too long. Installers fill the time with chatter that confuses the model. We dropped it to 20 seconds and draft quality went up. And we should have built the rejected-entries log from day one. We added it in week three and it became the single most useful piece of the system. It is the office manager's audit trail and the data we use to keep the SKU disambiguation tight.

What you could do this week

If you run a service business with paper reports and an ERP backend, you do not need a 41-person pilot to find out whether this works for you. Take ten consecutive service reports from last week. Write down, for each one, exactly which fields end up in your ERP. Count the lookups per row. If it is more than three, and if your office manager processes them after hours, the case for an agent is already made. The hard work is the glossary and the ERP integration, not the model.

When we built this RAG agent for the Eindhoven installer, the thing we kept underestimating was how much of the work happens outside the model. The retrieval index took a week. The brand glossary took two afternoons. The Exact Online write path, the rate-limit handling, and the refusal logic took six weeks. That is the shape of every field-service rebuild we have shipped so far.

Key takeaway

In field-service RAG, the model is the easy part. The brand glossary, the ERP rate limits, and the refusal logic are what earn the office manager's trust.

FAQ

Why a PWA instead of a native iOS or Android app?

Installers update phones rarely. A PWA lets us push catalog and prompt changes instantly without waiting on a forced-upgrade cycle, and removes the app-store review path entirely.

How do you stop the agent from hallucinating SKUs?

Grounding on the wholesaler catalog plus a per-entity confidence floor. If the top SKU match scores under 0.72 similarity, the agent refuses to draft and asks the installer to re-record.

Why keep the paper form at all?

Two of the boiler suppliers require a paper signature on their own form for warranty validity. Fighting that battle was not worth year-one effort. The paper is photographed, then ignored by the agent.

How long did the full build take?

Eleven weeks from kickoff to office-manager rollout. Six of those weeks were the Exact Online integration and the rate-limit and refusal logic. The model and retrieval work took under three weeks.

Does the agent work in dialect or only standard Dutch?

Both, after we fine-tuned a 600-term custom vocabulary file for Whisper. Installer word-error rate on Brabants Dutch dropped from 14% to 6% after the vocab landed.

ragai agentscase studyprocess automationintegrationsoperations

Building something?

Start a project