← Blog

Email automation

Email triage agent: how we killed 200 Outlook rules

The shared mailbox had 2,431 unread, 47 folders and a rule set nobody dared touch. Two sprints later, one agent reads every message and routes it first.

Jacob Molkenboer· Founder · A Brand New Company· 3 Jun 2026· 9 min
Brass mail sorting rack on ivory linen with manila envelopes, one tied in chartreuse ribbon, brass bell, date stamp.

It was 23:18 on a Tuesday in January when Lieke, the operations lead at a 90-person accountancy north of Eindhoven, stopped trying to fix the Outlook rules.

She had been at it for two hours. A junior had renamed a client folder the week before. The rule that depended on the folder name had silently failed. Three days of supplier invoices had collected under a label called "Diverse". The shared mailbox showed 2,431 unread.

The rule set was eight years old. Two hundred and forty-one entries. Folders referencing people who had left in 2021. A nested rule that copied anything from a specific bank to four mailboxes, one of which had been archived. Nobody touched it because nobody understood it end to end.

Two sprints later, the rules were gone. One agent reads every message arriving at intake@, classifies it, posts it into the right Teams channel, drafts a reply, and watches the SLA clock. Here is what we did, in order.

The 200-rule baseline

Outlook rules are a great feature for one person managing fifty emails a day. They fall over the moment you have a shared mailbox, ten people who can each create rules, and eight years of staff turnover.

The firm had hit three walls at once. The client-side rule set was bumping the 256 KB rules quota in Exchange, which silently disables newer rules. Half the rules were order-dependent in ways nobody had documented. And because the folder structure carried institutional knowledge ("this client uses two domains", "this one wants their tax return in CC"), nobody was willing to delete a rule that might still matter.

We did not delete anything in week one. We turned on a parallel read.

Two weeks of mailbox traffic, categorized by hand

Before writing any routing code, we pulled fourteen days of mail through the Microsoft Graph API and read every message. Not the agent. Us. With Lieke.

The forty-seven folders collapsed into thirteen real categories:

  • Client intake (new engagement requests)
  • Supplier invoices addressed to the firm itself
  • Client invoices forwarded for processing
  • Year-end document deliveries
  • Payroll questions
  • VAT and quarterly filings
  • Audit follow-ups
  • Bank statements and automated feeds
  • Government correspondence (Belastingdienst, KvK)
  • Calendar and meeting noise
  • Internal CC from the firm's own accounting software
  • Spam and marketing
  • Anything mentioning curator, faillissement, or deurwaarder, always escalated

The last category had three messages in two weeks. Two of them were sitting unread on day three.

The triage agent in one paragraph

A Microsoft Graph change-notification subscription sits on intake@. Every new message fires a webhook into a small EU-hosted service. The service pulls the full message, asks an LLM to classify it against the thirteen categories above, decides what to do with it, then writes back to Graph and posts a card into Teams. Total budget per message is under two seconds, most of it the model call.

The webhook setup itself is well documented and you can read it in an afternoon. The interesting bit is the renewal dance: Graph mail subscriptions expire after roughly seventy hours, and you need a job that re-subscribes before they lapse.

// renew Graph mail subscription before the 4230-minute ceiling
const renewIfExpiring = async (sub: Subscription) => {
  const minutesLeft =
    (new Date(sub.expirationDateTime).getTime() - Date.now()) / 60_000;
  if (minutesLeft > 60) return;

  await graph
    .api(`/subscriptions/${sub.id}`)
    .patch({
      expirationDateTime: new Date(
        Date.now() + 4200 * 60 * 1000,
      ).toISOString(),
    });
};

Read the Microsoft Graph change notifications guide once, set a cron, and forget about it.

The classifier prompt that actually shipped

The prompt is unglamorous on purpose. No few-shot examples, no chain-of-thought theatre. The model is asked for structured output and nothing else. Confidence is a number the agent uses to decide whether to act, not an opinion to debate.

You are the intake classifier for an accountancy firm.
Read the email below. Output JSON only, with these keys:

  category:        one of [intake, supplier_invoice, client_invoice,
                   year_end, payroll, vat, audit, bank_feed, govt,
                   calendar, internal_cc, spam, escalate]
  confidence:      number between 0 and 1
  escalate:        true if the message mentions curator, faillissement,
                   deurwaarder, dwangbevel, or a named regulator
  client_id:       CRM id if the sender or domain matches, else null
  suggested_reply: a Dutch draft in formal-friendly tone, or null
  reasoning:       one sentence, max 25 words

Rules:
- If confidence < 0.85, suggested_reply must be null.
- If escalate is true, suggested_reply must be null.
- Never invent a deadline. If no date is present, omit it.

Everything downstream keys off that JSON. No prose response is ever parsed.

Routing decisions and the human-in-loop list

Most categories auto-route. A supplier invoice goes to the finance Teams channel with the attachment extracted and a draft entry for the accounting system. A bank statement is parsed and dropped into the shared drive. A government letter from the Belastingdienst is posted to the partner channel with the deadline pulled out of the body.

Four cases are explicitly never auto-handled:

  • Anything matching the escalation keyword list above.
  • Audit follow-ups from external auditors.
  • Any message where classifier confidence falls below 0.85.
  • Any sender flagged as "sensitive" in the CRM (litigation, ongoing dispute, recently switched advisors).

Those four cases route to a named partner with a Teams ping and the classifier's reasoning attached. The agent does not draft a reply for them. It does not even suggest one.

Reply drafts, never auto-send

The single decision that bought us the firm's trust was that the agent drafts, the human sends. Always.

The classifier output includes a suggested reply, written in Dutch in the firm's house style, with the client's name pulled from the CRM and the relevant deadline interpolated. The draft lands in the responder's Outlook as a real draft, ready to send. They read it, edit if needed, click send.

We measured this on the first two weeks. 71% of drafts went out unedited. 22% had small tweaks (a sentence added, a tone shift). 7% were rewritten or discarded. That last number stayed flat as volume grew, which told us the agent had not learned to fake confidence in cases it should have escalated.

Three gotchas worth knowing about

The first one we expected. Outlook rules were institutional memory. Before we deleted any of the 241 rules, we interviewed the four people who had created the most of them, exported the rule set, and asked the LLM to translate each rule into a sentence of English. Then we matched the sentences against our thirteen-category list and flagged the ones that did not map cleanly. Those were the rules that encoded actual knowledge ("this domain belongs to client X's holding company"). We pushed that knowledge into the CRM, where it belonged, before we touched anything in Outlook.

The second was BCC traffic from the firm's own accounting software. It looked like real client mail to the classifier, because in a sense it was. We added a sender allow-list step before the classifier ran, which routed automated internal mail straight to a dead-letter folder.

Warning

Out-of-office replies will eat your agent alive. A client OOO bounces into the shared mailbox, the classifier sees a Dutch sentence about being unavailable until Monday, and unless you check the Auto-Submitted header it gets treated as a new intake. We saw three loops in week one. Always honour the RFC 3834 auto-submitted header before classification runs.

What changed for ninety people

The numbers we tracked are unglamorous, which is why we trust them.

Average time from message arrival to a human eye on it dropped from around six minutes during business hours (and indefinite outside them) to under thirty seconds, twenty-four hours a day. The "wie pakt deze?" Teams messages, where someone copy-pasted a forwarded email asking who handled it, dropped from a baseline of about forty a day to under five. The shared mailbox folder structure collapsed from forty-seven folders to nine archival ones.

The change Lieke pointed to was harder to measure. At 22:00 on a weekday, nobody was scrolling the shared mailbox anymore. The partners had stopped sweeping it as a habit, because the agent had already posted anything they needed to see. That meant they were also no longer accidentally answering things at midnight, which the operations team had been quietly worried about for years.

The smallest version you could ship next week

You do not need the full pipeline to get most of the value. Subscribe to one shared mailbox via Microsoft Graph. Pipe each new message through one classifier call with a list of five categories you can name off the top of your head. Post the result, with the original message, into a single Teams or Slack channel with a button that says "I've got this".

Run that for a week. You will learn which five categories were wrong, which folder structure was load-bearing, and whether your team will actually trust a bot to triage their inbox. Then you can think about drafts, SLAs, and routing.

When we built the triage agent for the firm above, the thing we ran into was that the rules were not rules at all. They were institutional memory written into folder names and rule conditions. We ended up extracting that knowledge into the CRM before any classifier saw the mailbox, and that is the pattern we now reuse whenever we ship AI agents against a legacy inbox.

If you do one thing this week, open your shared mailbox, sort by folder size, and ask the three biggest folders what they actually contain. That conversation is where the real work starts.

Key takeaway

Outlook rules are institutional memory in disguise. Read the mail first, write the agent second, and put the knowledge in the CRM where it belongs.

FAQ

Can the agent send replies on its own?

We always keep a human on the send button. Drafts land directly in the responder's Outlook, they review and click send. In the first two weeks of measurement, 71% of drafts went out unedited.

What about GDPR and the Dutch AVG when client mail runs through an LLM?

The classifier runs on EU infrastructure, message bodies are never logged beyond a hash, and the model provider has a signed DPA. Sensitive client flags in the CRM block the LLM call entirely.

How often do Microsoft Graph subscriptions expire?

Mail change-notification subscriptions expire after about 70 hours. Run a renewal job hourly that extends any subscription with under an hour left, and you will never lose one.

Did you delete all 241 Outlook rules at once?

No. We ran the agent in parallel for two weeks, compared its routing against what the rules would have done, then switched the rules off once the agent matched on the categories that mattered.

email automationai agentsautomationcase studyworkflowoperations

Building something?

Start a project