← Blog

RAG

RAG for KvK extracts: a Dutch fiscal advisory case study

A 17-person fiscal advisory near Den Bosch used to keep four part-timers busy retyping KvK extracts into Yuki. A RAG agent now does each dossier in 90 seconds.

Jacob Molkenboer· Founder · A Brand New Company· 10 Oct 2024· 9 min
Open oak index-card drawer on ivory paper, one card pulled forward with a green tab, brass divider, folded document beside.

Wednesday afternoon at a fiscal advisory near Den Bosch station. The office runs on quiet open-plan desks, two espresso machines, and four part-timers who, between them, retype the contents of KvK extracts into Yuki for ninety to one hundred new dossiers a week. Each dossier takes a careful person eight to twelve minutes. Each typo costs forty.

The firm has seventeen people. Two partners, five seniors, six juniors, three back-office. The four part-timers sit under "back-office" but they cost the same as a senior, and they leave every nine months because retyping addresses for a living is not what anyone signed up for. The managing partner had tried hiring four more. None lasted past the trial.

When we sat down with them in February the question was small and sharp: can a piece of software read a scanned KvK extract and post the new relation into Yuki without a person in the loop. The answer was almost yes.

Why KvK extracts are uniquely painful

A KvK extract is a one-to-five page PDF from the Dutch Chamber of Commerce that lists the legal identity of a business: KvK number, RSIN, statutory name, trade names, visiting and postal addresses, legal form, directors, authority types (alleen, gezamenlijk, beperkt bevoegd), and on newer extracts the UBO indicator.

Three things make them a bad match for naive OCR plus a regex.

First, half of them are scanned. The advisory's clients email PDFs that have been printed on an inkjet, scanned on a Brother MFC, and re-saved through Outlook's preview. Resolution is between 150 and 240 DPI. Skew is between zero and four degrees. Tesseract gets the body text right and the table cells wrong.

Second, the layout is not a form. KvK extracts are typeset like a legal document. Director sections repeat with a variable number of attributes. A BV with two directors looks structurally different from a stichting with a five-member board. A filiaal of a Maltese parent has a different header block again. There is no PDF form layer to lean on.

Third, the data has consequences. The wrong RSIN on a Yuki administration breaks the link to the tax filing flow downstream. The wrong authority type on a director creates a compliance gap that an accountant will eventually have to explain in a partner meeting.

The four part-timers were not slow because they were bad at typing. They were slow because they had to think.

Architecture: OCR, retrieval, extraction, post

The pipeline is four stages, each with one job.

# pipeline.py: one dossier, four steps
from pathlib import Path
from app.ocr import ocr_pdf
from app.retrieval import similar_dossiers
from app.schema import KvKExtract
from app.extractor import extract_structured
from app.yuki import YukiClient

REVIEW_THRESHOLD = 0.92

def process(pdf: Path) -> dict:
    pages = ocr_pdf(pdf)                            # ~3s, Dutch Tesseract + deskew
    examples = similar_dossiers(pages.text, k=4)    # ~400ms, pgvector cosine
    extract: KvKExtract = extract_structured(
        text=pages.text,
        examples=examples,
    )                                               # ~12s, structured output
    if extract.confidence < REVIEW_THRESHOLD:
        return {
            "status": "needs_review",
            "reason": extract.flagged_fields,
            "dossier": extract.model_dump(),
        }
    yuki = YukiClient.from_env()
    contact_id = yuki.contacts.upsert(extract.to_yuki_contact())
    return {"status": "posted", "contact_id": contact_id}

Step one is OCR. We run each page through Tesseract with the Dutch language pack, deskewed by OpenCV at the page level. For low-confidence pages we fall back to a cloud vision API. Cost is roughly €0.003 per dossier in cloud-vision fees and twelve seconds of CPU.

Step two is retrieval. The firm has roughly 2,800 historical dossiers, each one a (raw KvK text, final Yuki contact) pair, sometimes with a human correction in between. Those pairs are embedded and stored in a pgvector table. For a new extract we retrieve the four nearest neighbours by cosine similarity. The retrieval layer is what makes this a RAG system rather than a one-shot extractor. The model gets to see how a previous, structurally weird dossier (a CV with a foreign managing partner, say) was eventually written into Yuki.

Step three is structured extraction. The model is forced into a Pydantic schema. The schema rejects an RSIN that does not match ^\d{9}$, a legal form that is not in the Dutch enum, a director without a birth date. A confidence score under 0.92 sends the dossier to the human review queue.

# schema.py: Pydantic v2
from typing import Literal
from pydantic import BaseModel, Field

LegalForm = Literal[
    "BV", "NV", "VOF", "CV", "Eenmanszaak",
    "Stichting", "Vereniging", "Maatschap", "Filiaal",
]
Authority = Literal["alleen", "gezamenlijk", "beperkt", "geen"]

class Address(BaseModel):
    street: str
    house_number: str
    postal_code: str = Field(pattern=r"^\d{4} ?[A-Z]{2}$")
    city: str
    country: str = "NL"

class Director(BaseModel):
    full_name: str
    born_on: str = Field(pattern=r"^\d{4}-\d{2}-\d{2}$")
    authority: Authority
    is_ubo: bool

class KvKExtract(BaseModel):
    kvk_number: str = Field(pattern=r"^\d{8}$")
    rsin: str | None = Field(default=None, pattern=r"^\d{9}$")
    statutory_name: str
    trade_names: list[str] = []
    legal_form: LegalForm
    visiting_address: Address
    postal_address: Address | None = None
    directors: list[Director]
    confidence: float = Field(ge=0.0, le=1.0)
    flagged_fields: list[str] = []

Step four is the Yuki post. Yuki has a SOAP-era web service called the Domain API. It is not pretty, but it is stable and it takes a Contact envelope with ContactCode, ContactType, addresses, and a list of contact persons. We wrap it in a small Python client and handle the three error classes we have seen in production: contact already exists, session expired, malformed VAT number.

There is no model magic in any of this. Most of the engineering was on the plumbing: the OCR pipeline, the retrieval index, the queue, the Yuki client, the review UI. The model is a component, not the product.

The retrieval layer that earned its name

A reasonable reader will ask whether retrieval was needed at all. KvK extracts are a public format. Could the schema and a strong system prompt not do the job alone?

We tried. The clean-form baseline (no retrieval, just a careful system prompt) reached 88% straight-through on a holdout of 200 dossiers. The four part-timers reached 96%. The gap was the edge cases.

A Dutch CV (commanditaire vennootschap) has a beheerder and one or more commanditaire vennoten. The vennoten do not have authority over the company in the operational sense, but they do appear on the extract. A naive extraction marks them as directors. A retrieval over past dossiers surfaces three previous CVs where a senior had stripped the vennoten out and added a note in Yuki's free-text field. With those examples in the prompt the model learns the firm's convention.

A stichting that holds shares as part of an STAK structure has the same UBO complication. Retrieval surfaces the firm's preferred mapping into Yuki.

A filiaal of a foreign parent has no RSIN. Retrieval shows the model that the firm leaves the RSIN field blank and puts the parent's tax ID in a custom Yuki field.

Each of these failed without retrieval. Each works with four examples in context.

Takeaway

The hard part of a production extraction agent is not the model. It is the institutional convention layer. RAG is how you teach the agent your firm's conventions without writing them down.

Posting to Yuki without breaking the books

Yuki's API is older than its UI. It uses session-token authentication, returns XML, and treats most write operations as idempotent only if you supply your own ContactCode. Three rules we settled on after the first month.

We generate ContactCode from the KvK number: KVK-12345678. This makes the upsert truly idempotent across retries and makes the contact searchable in Yuki's UI without anyone learning a new ID scheme.

We never post inside the same request as the extraction. The extraction can take fifteen seconds. The Yuki post takes one. We queue. A failed Yuki post never retriggers the extraction.

We always write the original PDF to Yuki as a Document attached to the contact, even when the structured fields land cleanly. If the model got something subtly wrong nine months from now, the source of truth is one click away. This was the one design decision the partners insisted on. They were right.

What 90 seconds actually buys

Numbers, after roughly four months in production.

Average wall-clock per dossier: 86 seconds, of which 12 are the model and the rest is OCR, retrieval, Yuki round-trip, and queue overhead.

Dossiers per week: 110, up from 95. Two reasons. First, the bottleneck was the part-timers' availability, so demand was being silently throttled by the rate at which referrals were accepted. Second, intake is no longer batched on Tuesday and Thursday afternoons.

Straight-through rate: 94%. The other 6% go to a review queue that one of the seniors clears in twenty minutes a day.

Headcount change: the four part-time data-entry roles were not refilled when the last one rotated out in April. One of them, the one who liked the work, was offered a junior bookkeeper track. She took it.

Cost of the build: low five figures. Cost of running it: roughly €60 per month in vision-API fees plus a small VPS for the worker.

These are real numbers from one firm. They are not a benchmark. A larger firm with messier intake (handwritten letters, oddly cropped phone photos of extracts) will not see the same straight-through rate without more retrieval examples and a longer human-in-the-loop period.

Where it still fails

Two failure modes we have not solved.

Phone-photographed extracts at a fifteen-degree angle and uneven lighting drop OCR confidence below where the structured extractor can recover. We currently bounce these back to the client with a one-line email asking for a flat scan. About 3% of intake.

Extracts older than 2019 use a slightly different layout for the authority section. We have eleven of these in production failure logs. The fix is to add them to the retrieval corpus. We have not done it because eleven is not yet annoying enough.

There is a third failure that is not technical. The first month, two seniors did not trust the agent. They re-typed extracts that had already been posted, into a personal spreadsheet, to compare. The fix was not better software. The fix was a weekly review meeting for the first eight weeks where the partner showed the review queue, walked through every flagged dossier, and explained the decision the agent had made. Trust is built the way it is built.

What we learned

The win at this firm was not the model. The model is a few hundred lines of prompt and a Pydantic schema. The win was the retrieval corpus (2,800 cleanly labelled past dossiers that the firm already had, in their own Yuki account), the confidence threshold (set conservatively, then loosened), and the discipline of attaching the source PDF to every contact so that nothing was ever lost.

If your firm has a queue of structured-data-entry-from-PDFs work, the five-minute audit is this: count the dossiers per week, multiply by the average minutes per dossier, divide by 60. If the answer is greater than the working hours of one person, the maths starts to favour an agent. If you already have a labelled archive of past work, the maths is one-sided.

When we built this for the Den Bosch firm the hard part was not the extraction, it was the Yuki Domain API and the conventions only the partners knew. We solve most AI agents work the same way: by treating the model as the cheap part and the institutional knowledge as the asset.

The audit takes five minutes. Open your operations dashboard, count last month's new-client dossiers, and multiply by your true cost per minute of back-office time. The number is usually larger than people expect.

Key takeaway

The hard part of a production extraction agent is not the model. It is the retrieval corpus that encodes your firm's conventions.

FAQ

What is a KvK extract?

A PDF from the Dutch Chamber of Commerce listing a company's legal identity: KvK number, RSIN, statutory name, addresses, legal form, and authorised directors. Most arrive scanned, which makes naive extraction unreliable.

Why use RAG and not a one-shot extraction call?

Retrieval surfaces past dossiers with similar edge cases (CV vennoten, STAK foundations, foreign branches) so the model learns the firm's conventions without anyone hard-coding them in the prompt.

What happens when the model is unsure about a field?

A confidence score under 0.92 routes the dossier to a human review queue. The original PDF is attached to every Yuki contact, so any silent error is recoverable later.

How long does this take to roll out at a 10 to 30 person firm?

Four to six weeks if you already have a labelled archive of past dossier-to-system mappings to seed retrieval. Longer if you have to build the corpus from scratch.

ragai agentsprocess automationcase studyintegrationsoperations

Building something?

Start a project