Data scraping

Browserbase vs Stagehand vs Playwright: carrier scraping

A Rotterdam expediteur needed 4,800 weekly rate checks across Maersk, MSC, and Hapag-Lloyd. We tested Browserbase, Stagehand, and a hand-rolled Playwright loop. Here's what broke.

Jacob Molkenboer· Founder · A Brand New Company· 20 Jun 2026· 9 min

Three brass shipping tags with linen twine on ivory paper, chartreuse wax seal on top tag, manila manifest with red stamp.

It is 23:14 on a zondagavond. Maersk has quietly shipped a Next.js rewrite of their instant-quote page. Every CSS selector our agent depended on is gone. The 18-person expediteur in Rotterdam that hired us to monitor 200 trade lanes is going to wake up to a dashboard full of nulls unless someone gets to a laptop in the next hour.

This post is what we learned scoring three scraping stacks — Browserbase, Stagehand, and a hand-rolled Playwright loop with Claude as the brain — against the only three questions a freight forwarder actually cares about: what does each rate-check cost, what happens when Cloudflare Turnstile gets harder, and who is on call when the carrier rewrites their site on a Sunday night.

The brief, in numbers

The client books containers from Asia and the Med into Rotterdam, Hamburg and Felixstowe. Their offer process used to start with a junior pulling spot rates from the Maersk, MSC and Hapag-Lloyd portals. Three carriers, 200 active routes, roughly eight refreshes per route per week to catch GRI announcements before the customer's competitor does. That arithmetic gives us 4,800 quote-checks a week, and the agent we built has to do them under a Cloudflare Turnstile shield, behind a portal login, with the result landing in a Postgres table that fires a Slack alert when a spot rate moves more than 7% on a watched lane.

We had three real options for the scraping layer.

Browserbase as the substrate

Browserbase is a managed headless-browser service. You connect to it over CDP, get a Chromium instance with stealth defaults, residential proxies, and a session recorder you can hand to a customer-success engineer at 09:00 on Monday morning. You still write your own Playwright code. Browserbase solves the where the browser runs problem. It does not solve what the browser does on the page.

For this workload, the appealing parts were the proxy pool and the built-in Turnstile handling. The unappealing part was browser-minute billing. 4,800 checks at roughly 35 seconds each is about 47 browser-hours a week before retries. That bill is fine if the data is unblockable everywhere else; it stops being fine once you remember you can rent a €40/month VPS that runs Chromium forever.

Stagehand as the intent layer

Stagehand is the Browserbase team's open-source framework that wraps Playwright in three LLM-driven primitives: act(), extract(), and observe(). Instead of telling the browser to click [data-testid="rate-row-0"], you tell it to extract the spot rate for a 40HC from Yantian to Rotterdam, departure within the next 14 days. A model resolves the intent against the current DOM at run time.

That is exactly the right pitch for this workload. Selectors break every quarter; carrier marketing teams ship redesigns without telling their own engineers, never mind their scrapers. An intent-level script survives a Next.js rewrite as long as the words on the page still mean what they meant before.

import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";

const stagehand = new Stagehand({ env: "BROWSERBASE" });
await stagehand.init();

await stagehand.page.goto("https://www.maersk.com/instant-quote");
await stagehand.page.act("dismiss the cookie banner");
await stagehand.page.act(
  "search for a 40HC from Yantian to Rotterdam, sailing within 14 days"
);

const { rate } = await stagehand.page.extract({
  instruction:
    "extract the cheapest spot rate in USD for a 40HC, including the carrier surcharges line",
  schema: z.object({
    rate: z.object({
      amount: z.number(),
      currency: z.string(),
      transitDays: z.number(),
      sailingDate: z.string(),
    }),
  }),
});

The cost is honest: each check costs you Browserbase minutes and a few thousand LLM tokens for the act/extract calls. Across 4,800 weekly checks that adds up. The maintenance cost is the inverse curve — almost nothing on a normal week, and still almost nothing on the night Maersk ships their rewrite, because the model just re-resolves "the cheapest 40HC rate" against the new DOM.

The hand-rolled Playwright loop with Claude tool use

The third option was the one we had built for two earlier clients: a Playwright instance running on our own infrastructure, controlled by a Claude tool-use loop. The agent gets four tools — screenshot, click_at(x, y), type(text), extract_region(box, schema) — and a system prompt that explains the job. The model reasons over the screenshot, picks the next action, and the loop terminates when the structured rate object comes back.

This is the same shape Anthropic ships in computer use, just narrower. You give up Stagehand's act/extract ergonomics and write the loop yourself; you gain the ability to mix vision-driven steps with cheap, deterministic Playwright steps inside the same session. For the 30% of pages on each carrier where the layout never changes (the login form, the route picker), you call page.fill() directly and burn no tokens. For the 70% that changes quarterly, the model takes over.

Takeaway

The real per-route cost is not LLM tokens or browser minutes. It is the median minutes an engineer spends rescuing the scraper after a carrier ships a layout change at 23:00 on a Sunday.

Per-check cost with the maths shown

We benchmarked all three across one test week. The unit was a single Maersk, MSC or Hapag-Lloyd quote-check from cold session start to structured row written. Order of magnitude only — your routes, retries and proxy mix will move these:

Hand-rolled Playwright on a €40 VPS, no LLM in the inner loop. Effectively free per check. Engineer hours: very few, until the layout changes, then dozens in one weekend.
Stagehand on Browserbase. Browser minutes plus roughly 3-6k LLM tokens per check for act and extract. At 4,800 checks a week the LLM bill alone is non-trivial but predictable. Engineer hours per redesign: roughly zero.
Claude tool-use loop on our own Chromium. Vision tokens dominate. We measured an average of 12-18k input tokens per check on volatile pages, because every loop step ships a screenshot. Cheap to run on stable pages, expensive on the volatile ones, and no browser-service bill.

Once you map the unit cost onto 4,800 weekly checks, the ranking is unambiguous: hand-rolled is cheapest when it works, Stagehand is the cheapest across the full year once you include redesigns, and the Claude tool-use loop is the most expensive per check but the most flexible if your job is half scraping and half filling forms.

Captcha defensibility under Turnstile

All three carriers gate their instant-quote portals behind Cloudflare Turnstile. Turnstile is harder than reCAPTCHA v3 to brute-force because it correlates browser fingerprint, TLS handshake and behavioural signals before it even shows a challenge. The right answer is rarely "solve the challenge" and almost always "don't trigger it."

Browserbase's stealth profile cleared Maersk and Hapag-Lloyd on roughly 94 of every 100 cold sessions during our test week. MSC was harder; their portal sits on a stricter zone and the cold-session pass rate sat around 71 in 100. Stagehand inherits whatever browser it runs on, so the answer is the same when you point it at Browserbase. The hand-rolled stack, running headed Chromium on our VPS with residential proxies via a third party, sat around 82 in 100 across all three carriers — worse than Browserbase, but tunable.

When the challenge does fire you either retry from a fresh session with a different exit IP, or you outsource to a solver. Every minute you spend tuning that pipeline is a minute you are not spending on your client's actual problem. The honest verdict: Browserbase's anti-bot work is worth paying for on the routes where the carrier is hostile.

The selector-patches-itself-on-Sunday question

Tariff agents are a maintenance product, not a build product. The scraping layer you ship is going to outlive at least two carrier redesigns, and the median time between a Maersk front-end rewrite and the next one is, in our experience, about seven months. So the only question that matters in the long run is: when the rewrite ships, what does Monday morning look like?

For the hand-rolled Playwright agent with hardcoded selectors, Monday morning looks like a senior engineer with the carrier's site open in DevTools, rewriting CSS paths and praying nothing else moved. We watched this happen four times in 2025 alone for an earlier client. The mean repair time was 90 minutes; the worst was a three-hour zondagavond panic when Maersk's new tracking page started serving rates from a different React island.

For Stagehand, the Monday-morning experience is checking the dashboard, seeing nulls for the first 40 minutes of Sunday traffic while the LLM resolves intents against the new DOM under retry, and then the rates coming back. Two of the four 2025 rewrites we lived through required no code change at all. The other two needed a one-line update to the natural-language instruction.

For the Claude tool-use loop the answer is similar to Stagehand, with one caveat: vision-driven loops sometimes succeed on the wrong cell, especially on table layouts where two rates look almost identical. We solved that with a schema-level sanity check — any rate below $400 per 40HC ex-Asia gets flagged for human review — but it cost us trust in the autonomous mode for two weeks. If you go with an LLM-driven scraper, ship that sanity check from day one. The failure mode is not a crash; it is a confident, wrong number going into your pricing model.

What we ended up shipping

The final architecture is boring, which is the right shape for a maintenance product. Browserbase runs the browsers. Stagehand handles the two carriers (Maersk and Hapag-Lloyd) whose portals change quarterly. A plain Playwright script — no LLM in the loop — handles the third carrier, MSC, because we found a JSON endpoint behind their portal that returns the rate object directly, and there is no point burning tokens on a page we never actually render. A nightly cron diffs the structured rates into Postgres; Slack gets pinged on the 7% threshold.

The 4,800 weekly checks now cost roughly the same as one junior FTE's afternoon used to, with no humans in the loop on a normal week and one Slack ping per quarter on the abnormal weeks. The expediteur has used the headroom to bid on routes they did not have capacity to monitor before, which is the only ROI calculation that matters in this business.

When we built the tariff-monitoring agent for the Rotterdam expediteur, the thing we kept coming back to was that the cheapest stack is the one that does not wake anyone up. We ended up solving it by paying a little more per check to Stagehand on the volatile carriers and treating it as a hybrid problem rather than a single-tool answer. The same pattern shows up under most of the AI agents we ship: a managed substrate at the bottom, an intent layer where the DOM is volatile, deterministic code where the surface is stable. If you want to run the same five-minute audit on your own scraper, open your incident log and count the Sunday-night entries from the last twelve months. That number is your real per-route cost.

Key takeaway

Treat the scraping layer as a maintenance product. The cheapest stack is the one that does not wake an engineer at 23:00 on a Sunday.

FAQ

Which stack did you actually ship for the Rotterdam expediteur?

Stagehand on Browserbase for Maersk and Hapag-Lloyd, plain Playwright against a reverse-engineered JSON endpoint for MSC, with a Postgres diff and a Slack alert on a 7% spot-rate move.

Is Stagehand worth the extra tokens per check?

On portals that get redesigned more than once a year, yes. The token bill is fixed cost; engineer time on Sunday night is not, and that is the line item that wrecks budgets.

Does Browserbase clear Cloudflare Turnstile reliably?

In our test week it cleared Maersk and Hapag-Lloyd from a cold session about 94 times in 100. MSC was tighter, around 71 in 100. You still want a retry-with-fresh-session policy.

What was the worst failure mode you hit?

An LLM-driven loop returning a confident wrong number from the neighbouring table cell. We caught it with a schema-level floor on the cheapest plausible rate per route, not with retries.

data scrapingai agentsautomationtoolingarchitectureintegrations

Building something?

Start a project