Data scraping
Scraping Dutch registries: a field guide to KvK, RDW, Kadaster
Three Dutch registries hold most of the public business data your team will ever need. The trick is knowing which to scrape, which to pay for, and which will burn your IP.

You have eight thousand KvK numbers in a spreadsheet. The sales lead wants directors, founding dates, and SBI codes by Friday. You open a browser, point a Puppeteer script at kvk.nl, set the concurrency to fifty, and at 14:32 the requests start coming back with a Cloudflare interstitial. By 14:40 the office wifi is blocked too. Welcome to scraping Dutch public registries the wrong way.
Most of the data in the Dutch public registries is, in fact, public. The trap is that public does not mean freely scrapable. Three registries hold almost everything an enrichment pipeline will ever ask for: the Kamer van Koophandel for companies, the Kadaster for real estate and land, and the RDW for vehicles. Each one has a different posture on access, and the posture is the whole story.
The free tier, registry by registry
Three quick orientations, in order of generosity.
RDW. The vehicle authority runs an open-data portal at opendata.rdw.nl on Socrata. No authentication required. The full vehicle register, minus personally identifying fields, sits behind JSON, CSV, and SoQL endpoints. Anonymous limits are around a thousand requests per rolling hour; with a free app token you move into the millions. License-plate to make, model, year, and APK history is a single GET away. If every Dutch dataset looked like this, none of this article would be necessary.
Kadaster. The land registry splits its world in two. The free half lives on PDOK (Publieke Dienstverlening Op de Kaart): BAG addresses and buildings, BRT and BGT topography, AHN elevation, several other base registers, all served as WFS, WMS, and Atom feeds under explicit reuse licences. The paid half is the cadastral record itself: who owns parcel X, how it is mortgaged, what changed hands and for how much. That sits behind Mijn Kadaster and is billed per query. Scraping the second half is not a technical question. It is the wrong question.
KvK. The Chamber of Commerce is the awkward one. There is a developer portal at developers.kvk.nl with a Search API and a Profile API. The test tier returns dummy data. The live tier bills per call. Bulk extracts via the Handelsregister Dataservice are a separate contract. The public website at kvk.nl sits behind Cloudflare and starts serving interstitials after a handful of requests from the same origin. If you want clean records at scale, the API is the only road. The scraper is a way to get yourself banned and your client a complaint.
The legal floor most playbooks skip
Three texts belong on the desk before any registry work begins.
The first is the Algemene Verordening Gegevensbescherming, the Dutch implementation of GDPR. Records of sole proprietors and named directors are personal data. B2B enrichment can usually rest on legitimate interest, but the legitimate interest assessment has to be written down before you start, not improvised after a complaint lands. The Autoriteit Persoonsgegevens has fined organisations for exactly this pattern, and a one-page LIA noting purpose, necessity, balance, and safeguards is the cheapest insurance you will ever buy.
The second is the Databankenwet, which implements EU Directive 96/9/EC. Anyone who makes a substantial investment in collecting, verifying, or presenting a database holds a sui generis right against substantial extraction or re-use, even when the underlying data is technically public. The case law to read is Innoweb v Wegener, where a metasearch engine querying a classifieds site was held to infringe, and Ryanair v PR Aviation, where a terms-of-service exclusion was upheld in the absence of a sui generis right. Together they make scraping a posted database a high-variance bet.
The third is article 138ab of the Wetboek van Strafrecht, computervredebreuk. Hitting an open URL is fine. Solving a captcha that the registry put in front of you, bypassing a login, or rotating residential proxies to evade an IP ban is at least arguable as circumventing a technical security measure. That is a criminal-law conversation, not a polite engineering one.
The moment a registry shows you a Cloudflare challenge or a captcha and you decide to defeat it, you have moved from scraping public data to a position you do not want to defend at the Autoriteit Persoonsgegevens.
The twenty-minute IP burn
Modern edge protection is not looking at the URL. Cloudflare, Akamai, and the rest fingerprint the TLS handshake (JA3, JA4), the HTTP/2 frame order, the header order, the missing Sec-Fetch-* headers, and the request rate per ASN. A vanilla Python requests session is identifiable in one packet, and forty parallel workers from a single office IP look exactly like what they are.
The defensive recipe is boring and works. Identify yourself in the User-Agent with a real product name, a real URL, and a real contact address. Run one concurrent connection per host. Sleep between requests with jitter; one to two seconds is a reasonable default. Honour Retry-After. If the server says sixty, sleep sixty. Send conditional GETs with If-Modified-Since and ETag, because most registry resources return 304 Not Modified and cost the server nothing. Cache to disk before any business logic touches the response. You will re-run, more often than you think.
A polite-by-default scraper in twenty lines
import httpx
import random
import time
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential
UA = "ABN-Enrichment/1.0 (https://abn.company; info@abn.company)"
class RateLimited(Exception): pass
@retry(
retry=retry_if_exception_type((httpx.HTTPError, RateLimited)),
wait=wait_exponential(multiplier=2, min=4, max=120),
stop=stop_after_attempt(5),
)
def fetch(client, url, etag=None):
headers = {"User-Agent": UA}
if etag:
headers["If-None-Match"] = etag
r = client.get(url, headers=headers, timeout=20)
if r.status_code == 429:
time.sleep(int(r.headers.get("Retry-After", "60")))
raise RateLimited()
r.raise_for_status()
return r
def crawl(urls, store):
with httpx.Client(http2=True) as client:
for url in urls:
cached_etag = store.get_etag(url)
r = fetch(client, url, etag=cached_etag)
if r.status_code != 304:
store.put(url, r.text, r.headers.get("ETag"))
time.sleep(1.2 + random.random() * 0.8)
Nothing clever. One worker per host, jitter, exponential backoff, identification, conditional GET. This pattern outlives every Cloudflare ruleset change because it is not adversarial. It looks like what it is: a small program reading public data slowly.
The API-versus-scraper breakpoint
The mental model most operators bring is that scraping is free and the API is expensive, so we scrape. Run the arithmetic on a real job.
The KvK Profile API on the live tier costs roughly two euro cents per call under most contracts. Eight thousand lookups at that rate is one hundred and sixty euros. Building, maintaining, and re-unblocking a scraper that survives the next Cloudflare rule change is at least one engineer-day at an honest internal rate of six hundred to one thousand euros. The scraper wins on raw API cost. It loses on time-to-result, legal posture, and the second time you need to run the enrichment.
The breakpoint is simple. If you will run the enrichment more than twice, or if the result feeds anything customer-facing, the API is cheaper. Per-call infrastructure is cheap relative to engineering time, and any finance team that has had to expense the second emergency scraper rewrite will tell you the same thing. Buy the call, not the rework.
Containment for agents that scrape
If the thing calling the registry is an AI agent, do not point it at the open internet. Put a mediating service between the agent and the registry. The agent calls a local /lookup?kvk=... endpoint. The mediator applies the rate limit, the on-disk cache, the allow-list of which paths are reachable, and the masking of personal fields demanded by the legitimate interest assessment. The agent never sees a 429, never sees a captcha, and never decides on its own to rotate proxies.
The principle is older than the word agent: the tenant cannot exceed what the landlord lets through. The agent gets a typed interface; the mediator enforces rate, scope, and field-level redaction. When a registry changes its rate posture, you change the mediator and ship a single redeploy. The agent code does not move.
Treat the registry as a network resource the agent does not own. The rate limit, the cache, and the legal posture belong to a service in front of the agent. The agent only ever sees clean JSON.
One pattern from a real build
We built an enrichment agent for a Dutch search agency this spring. The thing we ran into was that the KvK Search API returns partial company records for free-text queries; trade names, board composition, and SBI detail only show up if you make a follow-up Profile call per match, and Profile calls are billed. We solved it by caching every Profile result for ninety days against a per-KvK-number key, so the agent paid the registry at most four times a year per company instead of every Slack message. The same shape applies to most AI agents that touch external registries: bound the call site, cache by a stable key, let the agent ask twice for free.
Smallest thing you can do today: open a terminal and run curl -I https://opendata.rdw.nl/resource/m9d7-ebf2.json. Read the response headers. That is what a politely scrapable registry looks like, and it is the benchmark every other dataset on your enrichment list should be measured against.
Key takeaway
Public Dutch registry data falls into three tiers: open, paid-API, and legally toxic to scrape. Know which tier you are in before writing a single request.
FAQ
Is it legal to scrape the public KvK website?
Open URLs are reachable, but the KvK terms of service prohibit redistribution and the Databankenwet protects substantial extraction. For B2B enrichment at scale, the paid Search or Profile API is the only defensible route.
What is the cheapest legitimate way to enrich a list of KvK numbers?
The KvK Profile API on the live tier, cached for ninety days against the KvK number. Most company records are stable for months and a per-call cost around two cents amortises into noise once you reuse the cache.
Why does Cloudflare block my Python script almost immediately?
Edge protection fingerprints the TLS handshake, header order, and request rate, not the URL. A vanilla requests session is identifiable in one packet. Switch to httpx with http2, identify yourself, and slow to one request per second.
Do I need a legitimate interest assessment for B2B scraping?
Yes, if any record contains personal data such as sole-proprietor or director names. A one-page LIA covering purpose, necessity, balance, and safeguards, signed and dated before scraping, is what the Autoriteit Persoonsgegevens expects.