← Blog

E-commerce

Returns-agent unit economics: a field-guide to 13 misreads

A Returnly extensie quoted at €58k a year. A Sonnet plus PostNL cache build at a third. The thirteen places your per-order margin spreadsheet is quietly lying to you.

Jacob Molkenboer· Founder · A Brand New Company· 2 Jun 2026· 9 min
Brass two-pan scale on ivory paper, three twine-tied parcels on left pan, green postcard on right, ledger with wax seal.

It is Sunday evening in Utrecht. A founder running a €4M-ARR accessoires brand is staring at a PDF on her second monitor: a quote from a Returnly partner for their "AI-driven returns-agent" extensie — fraud, eligibility, the lot — €58.000 a year, three-month build, locked into the platform's webhook envelope. Q1 just closed at 9.200 retouren. The board call is Tuesday. She opens her unit-economics spreadsheet, the one with the per-bestelling marge formula stitched together from a 2022 carrier audit, and the formula says yes, this works, just barely.

It does not work. The formula is wrong in thirteen specific places, and we have seen the same wrong formula in eleven Dutch sub-€10M shops over the last eighteen months. The shape is so consistent that we now run a forty-minute audit against it before any returns-agent quote. This post is that audit.

What the right shape looks like

Before the misreads, the alternative. A returns-agent that handles 9.200 retouren a quarter does not need a SaaS extensie. It needs four moving parts:

  • A PostNL track-and-trace cache that resolves package status before any LLM is invoked. Most refund decisions are deterministic the moment the parcel hits the sorteercentrum.
  • A fraud-rubric gate: a hand-written checklist of seven to twelve conditions (refund frequency, IP risk, item value, prior chargebacks) scored without an LLM at all.
  • A Claude Sonnet decision step, invoked only on the cases the gate flags as ambiguous, with the rubric and the parcel state passed in as context.
  • A human-in-the-loop queue for the residue, sized at roughly 3-6% of total returns.

The marginal cost per retour, fully loaded, lands between €0,18 and €0,34 depending on how often Sonnet gets called. Returnly's extensie was quoted at roughly €1,80 per retour. The factor is real. The spreadsheet hid it because the formula was assembling cost the wrong way around.

The thirteen misreads, ranked

We rank them by how much it costs to undo. Seven you fix tonight by editing one prompt template or one config file. Six force you to rewrite the retour-beleid and possibly re-paper the algemene voorwaarden. The ranking matters because it tells you where the operations-manager can move alone and where she needs to put it on the next directie agenda.

Reversible in one prompt template

  1. Pricing the LLM at list, not at cached input. The rubric, the policy, the SKU catalogue: none of it changes between retouren. Cache it. Anthropic's prompt-caching discount on cached input is the single largest line-item swing in the whole model, and people forget it because the pricing page lists the headline number first.
  2. Treating "fraud check" as one LLM call. It is two. A cheap classifier (Haiku, or no LLM at all) shortlists. The expensive judge (Sonnet) only sees the shortlist. Skipping the cascade roughly triples the per-decision cost.
  3. Asking the LLM to read the track-and-trace string. PostNL's API returns structured fields. Parse them. The LLM is for cases where the parcel state and the buyer's claim disagree, not for OCR on a status code.
  4. No caching of carrier responses. A returns-agent that re-fetches the same barcode three times during one decision pays three times and waits three times. Cache by barcode with a short TTL. Sub-€10 of infra a month.
  5. Counting agent turns as if every turn is a full system-prompt reload. If you are not using prompt caching across the agent loop, you are paying input tokens for the same rubric on every tool call. Anthropic's caching guide covers the pattern.
  6. Forgetting that fraud-flagged refunds you do not issue are revenue. Most spreadsheets count avoided fraud as cost-saved, not as preserved omzet, which means the agent's ROI looks half as good as it is. Move it to the revenue side and the per-bestelling marge model flips.
  7. Counting human review time at zero. Even at 3-6% queue rate, that is roughly 270-550 retouren a quarter that a human touches. At seven minutes apiece you owe a real cost line. Returnly's quote hid this; your build should not.

Each of these is a config or template change. None of them requires anyone to talk to the legal team. If your operations-manager has the agent code, they can ship the lot in an afternoon and the per-retour cost falls by something close to 55%.

Forces a retour-beleid herziening

  1. Confusing "accept the return" with "refund without inspection". These are two decisions. The agent should always accept the return (legally you mostly have to, under EU consumer law, with narrow exceptions). The fraud question is whether you refund before the parcel is back and verified. Conflating them produces a spreadsheet that under-counts both legal exposure and warehouse cost.
  2. Treating the Returnly per-order fee as marginal. Most of it is fixed: the platform, the integrations, the dashboards. The marginal cost of one more retour through Returnly is much lower than the fee suggests, which means the build-vs-buy crossover is at higher volume than the spreadsheet implies. The flip side: at 9.200 retouren a quarter, you are already past it.
  3. One per-bestelling marge for all retour-redenen. Sizing returns from a regular customer cost almost nothing in expected fraud. Damaged-on-arrival cost almost nothing in fraud and almost everything in supplier-claim work. "Veranderd van gedachten" on a €180 item is where the fraud rubric earns its keep. Average them together and you build the wrong agent.
  4. Computing margin on bruto orderwaarde, not contribution margin after COGS, carrier, and payment fees. A €60 order with a €22 contribution margin is not a €60 risk. The fraud gate should reference contribution margin, not gross. Otherwise your threshold for auto-refund is set in the wrong currency.
  5. Carrier-cost arbitrage missing entirely. PostNL retour-postzegel and DHL pickup price differently per parcel weight band. Below €10 SKU value, the cheapest reverse logistics often beats the value of the item, and your agent should know that and offer the customer a refund without return for the bottom tail. Most retour-beleid documents we read do not allow this. They should.
  6. Not modelling the identiteits-verificatie cost on the agent provider. Anthropic's announcement that ID verification will be required for certain capabilities from 8 July changes the operating overhead for anyone running an agent with elevated tool permissions. Small money, real money, fixed-cost line. Which means it favours owning the build over renting a per-order extensie.
Warning

If your spreadsheet's per-bestelling marge formula uses bruto orderwaarde anywhere, you cannot reason about a returns-agent honestly. Fix that cell first. Everything else is downstream.

The math, plainly

Run the numbers yourself with cached-input pricing in mind. The rough shape, sized for the 9.200-retour quarter:

RETOURNS_PER_QUARTER = 9_200

# Gate filters ~70% deterministically (track-and-trace closed loop, trusted customer)
GATE_RESOLVES = 0.70
SONNET_CASES  = RETOURNS_PER_QUARTER * (1 - GATE_RESOLVES)  # 2,760

# Per-decision LLM cost, with prompt caching on the rubric+policy
RUBRIC_TOKENS_CACHED   = 4_200
DECISION_INPUT_TOKENS  = 600
DECISION_OUTPUT_TOKENS = 220

# Sonnet reference prices in EUR per Mtok (check the pricing page; FX moves)
PRICE_CACHED_INPUT = 0.30
PRICE_INPUT        = 3.00
PRICE_OUTPUT       = 15.00

cost_per_decision = (
    RUBRIC_TOKENS_CACHED   / 1_000_000 * PRICE_CACHED_INPUT +
    DECISION_INPUT_TOKENS  / 1_000_000 * PRICE_INPUT +
    DECISION_OUTPUT_TOKENS / 1_000_000 * PRICE_OUTPUT
)

llm_cost_quarter   = SONNET_CASES * cost_per_decision           # ~€10
human_cost_quarter = RETOURNS_PER_QUARTER * 0.04 * (7/60) * 22  # ~€944
infra_cost_quarter = 180     # cache + queue + observability

total = llm_cost_quarter + human_cost_quarter + infra_cost_quarter
print(round(total, 2), "EUR per quarter, all-in")

That total lands around €1.130 a quarter against a Returnly extensie quoted at roughly €14.500 a quarter. The asymmetry is not because Returnly is overpriced. It is because the SaaS price has to amortise the platform's whole product surface across customers who do not need most of it. You need fraud-gating and Sonnet judgement on 30% of cases. You are paying for forty other things.

Where the HN front page intersects this

Two recent threads matter for the build. Anthropic's notice that ID verification will be required for certain capabilities from 8 July means the agent's API key holder needs to be a real, verified entity. Fine for the founder, friction if you were planning to give the agent its own credentials. The companion piece, on temporary Cloudflare accounts for AI agents, sketches a pattern for short-lived, scoped credentials for downstream tools the agent calls. Together they imply that production returns-agents are about to look more like service accounts with audit trails than like prompts in a Lambda.

The third HN piece, on building reliable agentic systems, lands on the same point we keep making to clients: the LLM is the smallest part of the agent. The boring parts — the carrier cache, the rubric, the queue, the audit log — are where reliability actually lives.

What to do tomorrow morning

Open the per-bestelling marge spreadsheet. Find the cell that uses bruto orderwaarde. Replace it with contribution margin. Then put one tab next to it that splits retour-redenen into four buckets: sizing, damaged, changed-mind, suspected-fraud. The four numbers will tell you, in about ten minutes, whether your returns-agent project is sized to one prompt template or to a full retour-beleid herziening. Most of the eleven shops we audited learned they only needed the prompt template.

When we built the returns-agent for a Dutch fashion-accessoires brand at roughly this scale, the thing we ran into was that the PostNL track-and-trace cache had to handle weekend sorteercentrum delays without trapping refunds in limbo for 72 hours; we ended up solving it by routing those parcels straight to the human queue with a "carrier delay" tag. That is the kind of thing the AI agents work always comes down to: the LLM is fine, the seams are the work.

Key takeaway

The LLM is the smallest line on a returns-agent's P&L. The carrier cache, the rubric, and the human queue are where the unit economics actually live.

FAQ

Can a returns-agent really make refund decisions without human review?

Yes for the deterministic majority, no for the residue. A well-built agent auto-resolves 60-75% of returns via the carrier cache and rubric, then routes the ambiguous tail to a human queue with all the context attached.

What does Anthropic's ID verification requirement change for my agent?

From 8 July, the API key holder needs to be a verified entity for certain capabilities. For a single shop this is paperwork. It also makes self-hosted, audited agent infrastructure look more attractive than per-order SaaS.

Why gate on PostNL track-and-trace before invoking Sonnet?

Because most refund decisions are deterministic once the parcel state is known. Spending €0,02 on a carrier API call to avoid a €0,15 LLM call is the cheapest optimisation in the whole stack.

Is Returnly always the wrong choice?

No. Below roughly 2.000 retouren per quarter the build-vs-buy math usually favours the SaaS. The crossover point depends on contribution margin and fraud-rate, not order count alone.

ai agentsautomatione-commerceprocess automationoperationsintegrations

Building something?

Start a project