Email automation

Email automation playbook: 4,120 weekly grocery complaints

A 28-person grocer in Haarlem gets 4,120 delivery complaints a week against a Magento 1.9 store from 2013. Here is how we cut the inbox by 70%.

Jacob Molkenboer· Founder · A Brand New Company· 16 Jun 2026· 9 min

Sealed manila envelope with chartreuse ribbon on dark blotter, brass bell and cream index card on ivory desk.

Monday, 06:14, a dispatch desk above the Haarlem warehouse. The night shift left 837 customer emails in the klantenservice@ mailbox. Two service agents arrive at seven, and by lunch another 1,200 mails land on top. Yesterday's delivery van missed twelve drops in Heemstede because a side street was closed for a bridge inspection. The WhatsApp from the dispatcher to the team lead says one word: help.

This is a 28-person online supermarket. Twelve pickers, six drivers, four office staff, two service agents, founders, a stock controller, a bookkeeper. They do roughly €18M in revenue against a webshop that has been online since 2013. The webshop is Magento 1.9. The warehouse runs a custom Symfony 2 system that someone's brother-in-law wrote during a long winter and nobody has dared touch since.

They asked us to make the inbox smaller. Not to replace the stack. Not to migrate to Magento 2. Not to throw out the WMS. Just to keep the two service agents from drowning every Monday.

This is the playbook we ran.

What 4,120 weekly complaints actually look like

Before any agent gets near production, you have to know the shape of the inbox. We pulled twelve weeks of mails out of the IMAP archive and clustered them by hand. The breakdown was harder to ignore than any pie chart we could have drawn.

Missed window or late delivery: 38%. Driver was late, customer was not home, package went back to depot.
Wrong or missing item: 27%. Picker scanned the wrong SKU, or the substitution rule fired and the customer disagreed.
Quality complaint, non-allergic: 14%. Bruised avocado, sour milk one day before expiry, broken egg.
Refund chase: 9%. "I emailed last Tuesday about the missing yoghurt."
Allergie-klacht: 4%. Almonds in a "nut-free" tray, soy in a labelled soy-free sauce, gluten in a "glutenvrij" loaf.
Everything else: 8%. Address changes, password resets, GDPR exports, and the occasional marriage proposal to the bakery team.

That 4% allergy bucket is small in volume and enormous in risk. One badly handled mail there is not a refund problem, it is a liability problem and a regulator problem. We will come back to it.

The shape of the stack we could not change

Magento 1.9 reached end of life in June 2020. The merchant knew. They had two failed migration attempts behind them, both stalled on the custom checkout that calculates statiegeld per crate. The Symfony WMS speaks to Magento through a nightly cron and a REST endpoint that returns the wrong HTTP code on half the error paths. Order data lives in sales_flat_order. Replacement orders live in a separate table the WMS owns. Refund credit memos live in Magento. Three databases, two truths.

The brief was simple. We do not write to Magento. We do not write to the WMS. We only read from both. The agent lives in a separate process, reads the inbox over IMAP, writes outbound mail through a relay, and parks any human action it cannot do in a queue.

Warning

If a legacy system returns HTTP 200 with an error body, your agent will happily believe the refund went through. Read the body, not the status line. We learned this on day three.

Classification before generation

Every mail enters one funnel. Before any language model gets near the body, a classifier decides what kind of mail it is and what data to fetch. The classifier is a small prompt with a strict JSON schema, run against a cheap model. The more capable model only sees mails that survive the first pass.

// app/Console/Command/InboxIngestCommand.php
public function handle(MailEnvelope $envelope): void
{
    $order = $this->orderResolver->fromMail($envelope);

    $classification = $this->classifier->classify([
        'subject' => $envelope->subject(),
        'body'    => $envelope->plainBody(),
        'order'   => $order?->summary(),
    ]);

    match ($classification->intent) {
        Intent::MissedWindow      => $this->deliveryFlow->handle($envelope, $order, $classification),
        Intent::WrongOrMissing    => $this->wrongItemFlow->handle($envelope, $order, $classification),
        Intent::QualityNonAllergy => $this->qualityFlow->handle($envelope, $order, $classification),
        Intent::RefundChase       => $this->refundChaseFlow->handle($envelope, $order, $classification),
        Intent::Allergy           => $this->allergyFlow->handle($envelope, $order, $classification),
        Intent::Other             => $this->humanQueue->park($envelope, 'unclassified'),
    };
}

The order resolver is the unglamorous part nobody writes a thread about. It tries the order number from the subject, then the customer email, then the last six digits in the body, then a fuzzy match on the postcode plus delivery date. About 91% of mails resolve to a single order on the first attempt. The rest go to the human queue with a note that says which candidates the resolver found.

The refund threshold gate

The founders had one non-negotiable rule. The agent can refund a missing pack of butter on its own. It cannot refund a weekly grocery shop. The line they wanted was €40.

That single number changed the shape of the whole system. Every refund flow has two branches. Under threshold, auto-issue and send the apology. Over threshold, build the case file and park it for the klantenservice-lead. The lead sees a one-screen summary with the customer history, the disputed order, the calculated refund, the suggested reply, and one button.

// app/Service/RefundDecision.php
public function decide(Order $order, Money $proposed): RefundDecision
{
    if ($proposed->greaterThan(Money::EUR(4000))) {
        return RefundDecision::queueForLead(
            reason: 'over_threshold',
            evidence: $this->evidence->build($order),
        );
    }

    if ($this->history->refundsInLast30Days($order->customerId())->greaterThan(Money::EUR(8000))) {
        return RefundDecision::queueForLead(
            reason: 'repeat_claimant',
            evidence: $this->evidence->build($order),
        );
    }

    return RefundDecision::auto($proposed);
}

Two thresholds, not one. The second one matters more than the merchant expected. About 0.6% of customers were filing a complaint on roughly every order. Once we surfaced that to the lead, six accounts got a polite phone call and the refund volume on those accounts fell by 84% the next month. The agent did not catch fraud. The agent surfaced the pattern, a human read it, a human made the call.

The four-eyes gate for allergy claims

This is the part of the playbook we are most proud of, and the part with the least code. Any mail the classifier tags as Intent::Allergy is locked out of the auto-send pipeline. Full stop. The mail goes into a separate queue that requires two named humans to sign off before the SMTP relay will accept the outbound reply.

The first human is the service agent who wrote the draft. The second is the kwaliteitsborging lead who controls the supplier sheet. The relay only releases the message if both signoffs are present and the message id matches the case id. If a service agent tries to bypass the queue by replying from a personal mailbox, the outbound relay rejects it, because every service mailbox routes through the relay.

// app/Mail/RelayGuard.php
public function shouldSend(OutboundMail $mail): RelayDecision
{
    $case = $this->cases->findByMessageId($mail->inReplyTo());

    if ($case?->intent === Intent::Allergy) {
        if (!$case->hasSignoff(Role::ServiceAgent)) {
            return RelayDecision::reject('allergy_needs_agent_signoff');
        }
        if (!$case->hasSignoff(Role::QualityLead)) {
            return RelayDecision::reject('allergy_needs_quality_signoff');
        }
    }

    return RelayDecision::accept();
}

The gate sits in front of the SMTP relay, not in front of the email client. That detail matters. A gate inside the email client can be skipped by switching clients. A gate at the relay cannot be skipped without leaving the building.

Takeaway

The cheapest place to enforce a policy is the layer no human can route around. For email, that is the SMTP relay, not the inbox.

Tone and the excuusbrief problem

An auto-written apology in Dutch goes wrong in two specific ways. First, it sounds like a press release. Second, it uses words no Haarlemmer would use in a real conversation. We spent two days reading actual replies the lead service agent had written over the past year and built a tone profile from her drafts.

The rules ended up boring and effective. Address the customer by first name if the order has one. Mention the actual product that went wrong, by name, not "het artikel". Mention the actual driver's first name if it is a delivery complaint and the driver is the same person who normally delivers that postcode. End with a sentence that names the next concrete thing that happens, not "wij hopen u hiermee voldoende te hebben geïnformeerd".

The tone tests live in CI. Every change to the prompt runs against a golden set of fifty real customer mails, and a separate evaluator checks for the banned phrases. If the evaluator flags more than two in the set, the deploy fails.

The audit trail nobody asks about until they need it

Every inbound mail, every classification result, every model call, every signoff, every outbound mail, every relay decision goes into an append-only log with a hash chain. The hash chain is overkill for a grocer. The append-only log is not. When the NVWA called about an allergy incident in February, the merchant pulled the full timeline of that case in eleven minutes. The case file showed who saw the mail, what they decided, what supplier batch was involved, and what reply went out. That call could have lasted hours.

For the bounce side, the relay parses delivery status notifications against RFC 3464 and re-enters hard bounces into the customer record. Soft bounces retry on a backoff schedule. Anything that bounces three times gets a flag on the customer profile so the next agent who picks up that account knows the address is bad.

What the numbers did over twelve weeks

The agent went live in waves. First only the missed-window intent, then wrong-or-missing, then quality, then refund chase. Allergy never went on auto-send and never will.

Twelve weeks in, of the 4,120 weekly mails, the agent now handles 2,840 end to end with no human in the loop. Roughly 980 are drafted by the agent and approved by a service agent with at most one tweak. The remaining 300 are still fully human. The two service agents went from drowning to having time to call the repeat-claimant accounts and to walk the floor with the pickers once a week. The lead has thirty to forty refunds to review each morning instead of seven hundred unread mails.

The Magento 1.9 store is still there. The Symfony WMS is still there. Nobody touched a line of either.

What to take from this

If you have a legacy stack and a drowning inbox, do not start with a migration plan. Start with a classifier and a queue. Read from your old systems. Do not write to them. Put your gates in front of the relay, not in front of the humans. Set one money threshold and one repeat-claimant threshold and let the lead see both. And whatever you do, route the high-risk intents through two named humans before a single byte reaches an outbound socket.

When we built this for the Haarlem grocer, the hard part was not the model and not the prompt. The hard part was the order resolver and the relay guard. The same patterns sit underneath every process automation project we ship, whether the inbox is a supermarket or a SaaS support queue.

Five-minute audit for your own inbox: pull last week's mails, cluster them by hand into six buckets, and circle the bucket that holds the most risk. That bucket gets the four-eyes gate. The rest can wait.

Key takeaway

Put your policy gate at the SMTP relay, not the email client. It is the only layer in the stack your team cannot route around.

FAQ

Why not migrate off Magento 1.9 first?

Because migration is a years-long project and the inbox is a daily fire. Solve the daily fire with a side process that only reads from the legacy stack. The migration becomes easier later because you now have clean classification data.

Can the agent issue refunds directly in Magento?

Technically yes, in practice no. The agent only proposes refunds. A human clicks the button that calls Magento, even for sub-€40 cases. The audit trail is worth more than the saved click.

What happens to allergy mails when both signers are out?

The mail stays in the queue. We attach a 24-hour SLA timer and a founder is the third name on the on-call list. The agent never sends an allergy reply on its own under any circumstance.

How do you keep the tone from sounding generated?

Build a tone profile from real replies your best agent wrote, then run every prompt change against a golden set of fifty customer mails in CI. If the banned-phrase evaluator flags more than two, the deploy fails.

Does this approach work for non-Dutch inboxes?

Yes. The shape of the playbook is language-agnostic. The tone profile and the classifier prompts change per language, but the threshold gate, the four-eyes gate, and the relay guard are the same.

email automationai agentsmagentolegacy sitesworkflowoperations

Building something?

Start a project