Data scraping

Playwright vs Crawlee vs Workers: a tariff monitor test

A Rotterdam forwarder needed daily tariff snapshots from 38 carrier portals. We built the scraper three different ways before we landed on a stack we trust.

Jacob Molkenboer· Founder · A Brand New Company· 22 Aug 2024· 6 min

Three weathered shipping manifests on forest green linen, brass seal, jute twine, green wax stamp, red tariff mark.

At 05:48 on a Tuesday in March, the operations lead at a mid-size Rotterdam forwarder was on her ninth carrier portal of the morning. Maersk, MSC, CMA CGM, Hapag-Lloyd, all opened in separate tabs, all needing the same six numbers copied into the same spreadsheet before her 06:30 quote deadline. She had been doing this for nine years. We were brought in to make it stop.

The brief was straightforward. Scrape 1,400 carrier pages a day across 38 portals, push the deltas to a Slack channel and a Postgres table, and never miss a Monday when the spot market moves. The interesting part was the spread. Eighteen of those portals were JavaScript-rendered behind a soft login. Twenty were plain PHP pages last redesigned in 2014.

We built it three ways before we shipped.

Round one: Playwright everywhere

The first instinct was a single fleet of Playwright workers. One framework, one mental model, every carrier handled the same way. We spun up four EC2 t3.large boxes with a Postgres queue, a small Node orchestrator, and headless Chromium. It worked on day one. By day three we had problems.

Memory was the first wall. Each Chromium context held 180 to 260 MB, and rotating contexts to dodge fingerprinting cost us minutes per carrier. The static HTML pages took 4 to 7 seconds each because we were paying full browser-boot tax on a page that would have parsed in 80 ms with fetch and cheerio. Compute bills ran around €310 a month before we had added retries or a real proxy.

Playwright was right for the eighteen hard portals. It was theatrical overkill for the other twenty.

Round two: Crawlee in the middle

Crawlee is Apify's open-source crawling framework, and the thing it gets right is that you do not have to pick a strategy upfront. You can mix a CheerioCrawler for the static pages and a PlaywrightCrawler for the rendered ones, share a request queue between them, and reuse the same retry, proxy, and session-pool logic across both.

Rewriting onto Crawlee took four days. The win was operational, not architectural. Failed requests now retried with exponential back-off and rotating sessions instead of crashing the worker. The request queue gave us natural rate limiting per hostname. Set maxConcurrencyPerHostname: 2 and you stop hammering Hapag-Lloyd into a 429. Average headless time on the static pages dropped from 5.2 s to 240 ms because we routed them to Cheerio instead of Chromium.

Crawlee is not faster than what you would write yourself. It is faster than what you would debug yourself at 11 pm on a Sunday.

Round three: Cloudflare Workers for the boring sixty percent

For the static carriers, the sixty percent of page volume that had none of the headaches, we asked whether we needed a Node runtime at all. So we tried a hand-rolled scraper on Cloudflare Workers with HTMLRewriter, a D1 database for the queue, and Cron Triggers firing every fifteen minutes.

The code was shorter than the deployment YAML.

export default {
  async scheduled(event, env) {
    const targets = await env.DB
      .prepare("SELECT id, url FROM carriers WHERE rendered = 0")
      .all();

    for (const row of targets.results) {
      const res = await fetch(row.url, {
        cf: { cacheTtl: 0 },
        headers: { "user-agent": "abn-tariff-monitor/1.4 (+info@abn.company)" }
      });
      const html = await res.text();
      const rates = extractRates(html); // ~40 lines of HTMLRewriter
      await env.DB
        .prepare("INSERT INTO tariff_snapshots(carrier_id, payload, at) VALUES (?,?,?)")
        .bind(row.id, JSON.stringify(rates), Date.now())
        .run();
    }
  }
};

Twenty carriers, 840 pages a day, the whole sweep finishes in under twelve seconds. The Workers free tier covered it. The subsystem costs us nothing per month.

But Workers will not run a real browser. The CPU-time limit (50 ms per invocation on the free plan, longer on paid) and the absence of Node APIs mean any carrier that loads tariffs through XHR after DOMContentLoaded is invisible to you. We tried to fake it on two of those portals and gave up after a day.

Warning

Cloudflare D1 has a write quota per minute. Fan out 800 INSERTs in one cron invocation and you will hit it. Batch into a single prepared statement with batch(), or queue writes through a Queues binding.

What four weeks in production looked like

Numbers from the first four weeks after we shipped the hybrid:

Static pages on Workers: 99.6% success rate, €0 per month, mean latency 410 ms.
Rendered pages on Crawlee plus Playwright, on one Hetzner CPX31 box (€16 per month): 97.1% success rate, mean latency 6.4 s.
Total pages scraped per day: 1,418. Total monthly compute: €16, plus about €4 of residential proxy for the four portals that fingerprinted us.

The Playwright-everywhere stack we started with would have cost €310 per month to do the same job at lower reliability. The point is not that Playwright is bad. It is that paying browser tax on a static <table> is the kind of overkill that quietly drains hours and euros for a year before anyone notices.

The hybrid we shipped

Three pieces. A Cloudflare Worker on a 15-minute cron scrapes the twenty static carriers into D1, then pushes rate deltas to a Postgres replica through a single HTTPS webhook. A Crawlee project on one Hetzner box handles the eighteen rendered portals, using PlaywrightCrawler with a shared session pool and a residential proxy on the four picky ones. A small orchestrator reads from both sources, computes the carrier-vs-carrier delta, and posts a single Slack message at 06:15 Amsterdam, fifteen minutes before the ops lead opens her laptop.

Her morning portal-clicking went from 50 minutes to a 3-minute scroll through a Slack channel. Quote turnaround on spot inquiries dropped from 90 minutes to about 8.

One thing to do today

If you are scraping more than a few hundred pages a day, audit the share of them that actually need a browser. Open Chrome DevTools' Network tab on one of your targets, switch JavaScript off in the same tab, and reload. If the data is still in the HTML, that page belongs on fetch and cheerio, not on Playwright. Most teams we work with shave 60 to 80 percent of their scraping infrastructure cost with that one audit.

When we built this tariff-monitor for the Rotterdam forwarder, the surprise was how much of the value came from the process automation around the scraper, not the scraper itself. The Slack digest, the Postgres replica, and the alert when a carrier's HTML structure shifts ended up doing seventy percent of the workday-saving.

Key takeaway

Most scrapers pay browser tax on pages that do not need it. Save Playwright for portals that actually run JavaScript; use fetch and an HTML parser for the rest.

FAQ

When should you reach for Playwright versus a hand-rolled Worker?

If the data is in the HTML after JavaScript is disabled, use fetch and an HTML parser. Reach for Playwright only when the page loads its real data through XHR after DOMContentLoaded.

Does Crawlee work without Apify's paid platform?

Yes. Crawlee is open source and runs anywhere Node runs. The Apify platform is a hosted execution layer; the framework itself has no lock-in and no licence cost.

How do you handle a carrier portal changing its HTML?

Wrap each extractor in a schema check that asserts row count, column presence, and value types. Fail the run loudly and route an alert to Slack so you fix it before anyone re-prices a quote on stale data.

Why not just run Playwright on Cloudflare Workers via Browser Rendering?

Browser Rendering works, but per-session cost and cold-start times for 600 rendered pages a day put it above a €16 Hetzner box that runs Crawlee plus Playwright with cached browser contexts.

data scrapingautomationarchitectureoperationsintegrationsworkflow

Building something?

Start a project