Data scraping

Data scraping rebuilt: 17 supplier portals on Playwright

A 44-person Rotterdam food importer ran a Selenium scraper that broke every other week. We rebuilt it on Playwright plus an LLM extractor. The buyers stopped chasing prices at 7am.

Jacob Molkenboer· Founder · A Brand New Company· 9 Jun 2026· 9 min

Leather supplier ledger open on ivory desk, fanned paper price tags with linen twine, green ribbon, red wax seal.

The 6:47 alert

Marije runs operations at a 44-person wholesale food importer in Rotterdam. She is 36 minutes into her Tuesday when her phone vibrates. The overnight scraper, a Selenium job that pulls daily price sheets from seventeen supplier portals, has failed on three of them. One of the failed ones is the largest supplier of fresh fish. Her buyers start at seven. The phones to restaurants in the Randstad start at half past.

She knows what to do. She has done it twenty times this year. Open laptop, log in to each portal by hand, copy the price columns into a shared spreadsheet, paste the spreadsheet into the ERP. It takes about thirty minutes per portal if nothing weird happens. Three portals means a buyer is going to do something else first.

That was the state of the scraper in February. The brief from her CFO in March: we do not want a human in this loop, and we do not want the wrong prices either.

The original Selenium stack

The previous build was solid for its era. Python, Selenium 4, a headless Chrome with a custom Dockerfile, persistent cookies in a volume mount, a cron at 03:00 Amsterdam. Each portal had a hand-written scraper, on average around 180 lines, which clicked through the login, the cookie banner, the "ja, ik wil verder" interstitial, the date picker, then read prices out of a table or downloaded a CSV.

It worked, until it did not. Selectors were the first thing to rot. Then portals started adding Cloudflare Turnstile in front of the login. Two of the larger suppliers redesigned. A regional one moved its price sheet behind a PDF generator instead of an HTML table.

Around two portals broke every month. Each break ate roughly an hour of dev time to patch, and an unknown amount of buyer time while the patch was being written. We had the count for the previous twelve months: 27 incidents, 41 dev hours, an estimated 60 buyer hours.

The rebuild brief

The CFO did not ask for a rewrite. She asked for two things: fewer 6:47 phone calls, and no surprise margin loss because of a misread decimal. We turned that into three rules.

Survive portal redesigns without a code change, at least most of the time.
Fail loud when prices look wrong. Never post a price the system is not confident in.
Keep an audit trail of every value that hit the ERP, with the screenshot and the raw extraction.

Those three sound like product copy, but they decided the architecture.

Playwright as the browser layer

We replaced Selenium with Playwright. The reasons are well-rehearsed by now. Auto-waiting removes the most common class of "it works locally" bug. The trace viewer makes intermittent failures debuggable instead of mysterious. Storage state can be serialized per portal, which means login flows run once a week instead of once a night, which means a lot fewer CAPTCHAs.

The interesting choice was not Playwright over Selenium, though. It was treating each portal as a thin "navigate to the price page" function and nothing more. No table parsing, no CSV cleaning, no "find the column with the kg in the header." All of that moved into the next stage.

Here is the shape of a portal driver in the new system:

// portals/koelhandel-westland.js
export const portal = {
  id: 'koelhandel-westland',
  schedule: '0 3 * * 1-5',
  async goToPriceSheet(page) {
    await page.goto('https://portaal.koelhandel-westland.nl/login');
    await page.getByLabel('Klantnummer').fill(process.env.KW_USER);
    await page.getByLabel('Wachtwoord').fill(process.env.KW_PASS);
    await page.getByRole('button', { name: 'Inloggen' }).click();
    await page.getByRole('link', { name: /dagprijzen/i }).click();
    await page.waitForLoadState('networkidle');
    return {
      kind: 'html',
      html: await page.content(),
      screenshot: await page.screenshot({ fullPage: true }),
    };
  },
};

Twenty to forty lines per portal instead of a hundred and eighty. When the price page format changes, we do not touch this file. The driver is allowed to be dumb. It is the extractor's job to deal with the layout.

The LLM extractor

This is where the work moved to. The extractor takes one of three inputs (rendered HTML, a PDF byte stream, or an Excel file) and returns a strict JSON object that matches a schema. The schema is the contract between this stage and everything downstream.

// extractor/schema.ts
export const PriceSheet = {
  type: 'object',
  required: ['portal_id', 'sheet_date', 'currency', 'lines'],
  properties: {
    portal_id: { type: 'string' },
    sheet_date: { type: 'string', format: 'date' },
    currency: { type: 'string', enum: ['EUR', 'USD'] },
    lines: {
      type: 'array',
      items: {
        type: 'object',
        required: ['supplier_sku', 'description', 'price', 'unit'],
        properties: {
          supplier_sku: { type: 'string' },
          description: { type: 'string' },
          price: { type: 'number' },
          unit: { type: 'string', enum: ['kg', '100kg', 'piece', 'crate'] },
          vat_included: { type: 'boolean' },
          notes: { type: 'string' },
        },
      },
    },
  },
};

We use a model with native structured output (the schema is enforced at the API layer, not by a post-processor). The prompt itself is shorter than people expect, because the schema does most of the work. Roughly:

Extract every product line from the attached price sheet. Use the schema. Read units carefully. If a price is per 100 kg, set unit to 100kg and do not divide. Do not normalize, do not convert. If a row is a header, footer, total, or commentary, skip it.

The point of "do not normalize" is important. The model is good at reading. It is unreliable at unit conversion when a sheet mixes units across rows. Conversion belongs in code, where it is testable.

Normalization in code, not in prompt

Once the JSON is back, a deterministic pass runs:

Convert every price to EUR per kilogram. The rules are read from a YAML per portal so a buyer can correct them without a deploy.
Strip VAT if the portal includes it (we keep ERP prices net of VAT).
Map supplier_sku to our internal SKU via a lookup table. Unknown SKUs do not silently become "new products." They go to a queue.
Validate that the price falls inside a band based on the trailing 30-day median for that SKU. Default band is plus or minus 35 percent, tightened per category (fresh fish is wider than canned tomatoes).

// pipeline/normalize.ts
function toEurPerKg(line: ExtractedLine, rules: PortalRules) {
  let priceEur = line.price;
  if (rules.currency !== 'EUR') priceEur = fx(rules.currency, 'EUR', priceEur);
  if (line.vat_included) priceEur = priceEur / (1 + rules.vatRate);
  const perKg = {
    kg: priceEur,
    '100kg': priceEur / 100,
    piece: priceEur / rules.pieceWeightKg(line.supplier_sku),
    crate: priceEur / rules.crateWeightKg(line.supplier_sku),
  }[line.unit];
  return Number(perKg.toFixed(4));
}

The unit test suite for this file is bigger than the entire scraper from 2022.

Posting to the ERP

Their ERP exposes a REST API for supplier price list entries. We post in batches per portal, one transaction per sheet, with the screenshot stored in object storage and its URL attached to the row. Every posted line carries three identifiers: the portal id, the extraction run id, and the SHA-256 of the raw input. If anything ever needs to be reproduced, it can be.

await erp.priceLists.create({
  supplierId: portal.erpSupplierId,
  validFrom: sheet.sheet_date,
  evidenceUrl: screenshotUrl,
  extractionRunId: run.id,
  lines: lines.map((l) => ({
    sku: l.internal_sku,
    pricePerKgEur: l.normalized_price,
    sourceHash: l.source_hash,
  })),
});

The thing that matters is the evidenceUrl. When a buyer queries a price, she can click through to the actual screenshot of the actual portal page on the actual day. We learned the hard way that "the AI extracted it" is not an answer anyone in operations will accept on its own.

Drift detection and the queue

The two failure modes we cared about were silent wrongs and loud nothings. Silent wrongs (a decimal misread, a kilogram-vs-100kg confusion, a stale page served from a cache) get caught by the band check. Loud nothings (a portal redesign that the driver can no longer navigate) get caught by Playwright timing out on the navigation.

Both end up in the same queue. A single Slack channel, one message per portal per day, either a green check or a yellow flag with the screenshot and the raw extraction inline. Marije sees about two flags a week. Most resolve in under five minutes (she clicks "accept" and the price posts, or she clicks "investigate" and a buyer takes over).

Takeaway

The model does extraction. Code does normalization. A human does adjudication. Each layer is allowed to be honestly bad at the other two.

Ninety days later

Some numbers from the first three months of operation, measured against the previous twelve months of the Selenium era.

Portals running unattended: 17, up from 11.
Incidents requiring dev work: 4, down from a run-rate of around 7 for an equivalent quarter.
Portal redesigns survived without a code change: 3 of the 4 that occurred. The fourth (a login flow that moved to SMS-OTP) needed two hours.
Average daily extraction time: 14 minutes for all portals in parallel.
Buyer time spent typing prices: zero, down from an estimated 60 hours per year.
Misposted prices: zero detected. We had one near-miss that the band check caught (a portal briefly served a EUR/100kg sheet with the EUR/kg header, the band flagged it, Marije confirmed by phone with the supplier).

The thing that surprised us, looking back, was how much of the value sat in the queue design rather than the model. The extractor is doing a job that is, frankly, easy for it. The hard work was deciding what to do when it was wrong, and making "wrong" cheap.

Warning

If you build this, do not let the model normalize. Ask it to read, not to convert. Every unit conversion in a prompt is a bug waiting for the Monday after a long weekend.

What we would do differently

If we did the rebuild over, two things would change.

First, we would write the band-check rules per category from day one instead of starting with a global plus or minus 35 percent and tightening later. The global default let two minor misreads through in week two. Nothing material, but it eroded trust for a fortnight.

Second, we would not have built our own retry logic on top of Playwright for the first six weeks. Playwright's trace viewer plus a five-line retry-with-backoff is enough. We over-engineered, then deleted, an entire failure taxonomy. The lesson, written down so we stop relearning it: the model is the cheap part of a pipeline like this, and the plumbing around it is where the budget goes.

When we built the extraction and automation pipeline for the Rotterdam importer, the thing we ran into hardest was that every supplier portal expressed VAT differently and the buyers had never written the rules down. We ended up solving it by sitting with two buyers for an afternoon and turning their tacit knowledge into the YAML files that drive normalization. The scraper was the easy half.

The smallest thing you could do today: open your own brittle scraper, find the file with the most if portal == ... branches in it, and ask whether the layout knowledge could move into a schema-bound model call and out of your code. If yes, the rest of the rebuild follows from that one decision.

Key takeaway

The model reads, code converts, a human adjudicates. Make wrong cheap and the rest of the pipeline follows from that one decision.

FAQ

Why Playwright instead of Selenium for this kind of work?

Auto-waiting removes most works-locally bugs, the trace viewer makes intermittent failures debuggable, and storage state lets you skip nightly re-logins. Selenium can do these too, with more code.

Why not let the model do the unit conversion as well?

Models are unreliable at arithmetic when a sheet mixes EUR/kg and EUR/100kg rows. We have the model read and have deterministic code convert. The conversion layer is the most tested part of the pipeline.

How do you stop a wrong price from reaching the ERP?

Every line is checked against a 30-day median band per category before posting. Anything outside the band goes to a Slack queue with the screenshot and raw extraction. A human approves or investigates.

What happens when a supplier portal is fully redesigned?

Most redesigns survive because the layout knowledge lives in the extractor, not the code. When the login or navigation flow changes, the Playwright driver needs editing, usually under two hours.

data scrapingautomationcase studyintegrationsprocess automationworkflow

Building something?

Start a project