Operations

On-call agents vs PagerDuty: a scoring method for small SaaS

It's 02:14, a Stripe webhook is replaying 600 events a minute, and your two engineers are asleep. Should the next page wake them, an agent, or both?

Jacob Molkenboer· Founder · A Brand New Company· 6 Mar 2026· 9 min

Brass call bell, folded telegram with chartreuse ribbon, pocket watch at 2:14, iron shipping tag, red wax seal on ivory blotter.

It's 02:14 on a Tuesday. A Stripe webhook is replaying 600 events a minute against your API. Your two engineers are asleep — one is in Utrecht, one is on holiday in Chiang Mai. You have 240 alerts a month and three people who could plausibly own them. You also have a SOC 2 Type II audit in November and a co-founder who keeps forwarding you HN threads about agents that took down a customer's web server at 03:00 because nobody read the prompt.

This post is the method we use, at ABN, when a Dutch SaaS founder under €7M ARR asks us whether to outsource the next twelve months of on-call to an agent, a human rotation, or something in between. It is not a pitch for agents. Two of the last four times we ran this scoring exercise, the answer came back: hire a third engineer.

The three shapes on the table

You have, broadly, three options. Most write-ups treat them as ideologically opposed; they aren't.

Shape A — pure agent. A Claude-driven worker subscribed to your alert pipeline. It reads the alert, runs read-only diagnostics, classifies, attempts the fix from a runbook, and pages a human only on escalation. The human is asleep by design.

Shape B — three-person rotation. The classic. PagerDuty (or Grafana OnCall, or Opsgenie) pages a human, the human runs the runbook in their head plus a terminal. No agent in the loop.

Shape C — agent-first with human-approval gate. The agent does the diagnosis and proposes the fix. A human approves the destructive bit (the restart, the rollback, the WAF rule). If no human ACKs inside four minutes, the agent either holds or escalates depending on the class of incident.

Founders want a single answer. We score the three against three axes instead.

Three axes that actually matter for sub-€7M SaaS

We ignore vanity metrics: MTTR distributions you can't compute, "incident velocity", anything ending in "score". For a sub-€7M Dutch SaaS, three things move the decision.

1. Per-incident MTTA, weighted

Mean Time To Acknowledge: the gap between the alert firing and a human or agent saying "I have it". PagerDuty's reference definition is the one most auditors use.

The weighting matters. A 90-second MTTA on a "queue depth at 80%" is irrelevant; a 90-second MTTA on "payments are 500ing" might be worth €30k. We weight each alert class by what it costs you per minute of unacknowledged time. Roughly:

ALERT_WEIGHTS = {
    "payments_5xx": 800,        # €/min — Mollie, Stripe, Adyen failures
    "auth_500":      300,
    "search_down":   120,
    "queue_depth":    10,
    "disk_warn":       5,
}

def weighted_mtta(incidents):
    total_cost = sum(
        i.mtta_seconds / 60 * ALERT_WEIGHTS[i.class_]
        for i in incidents
    )
    return total_cost / len(incidents)

Plug in your last 90 days. If the answer is under €40, you don't have an on-call problem; you have a CFO who reads metrics out of context. If it's over €400, you have one regardless of which shape you choose.

2. Post-mortem defensibility for a SOC 2 auditor

This is the axis founders forget until July of the audit year. Your auditor is going to ask, for any incident touching customer data: who detected, who diagnosed, who decided to take the fixing action, and where is the trail. The AICPA Trust Services Criteria are not specific about whether "who" has to be a human — CC7.3 cares about the response process, not the species of the responder — but auditors are conservative, and your customer's vendor-risk team is more conservative still.

The defensibility question reduces to one thing: can you, six months later, show a complete chain of alert → diagnosis → decision → action → verification? A pure-human rotation produces this trail by accident in Slack. A pure-agent setup produces a much better trail, if you instrument it. A hybrid produces the cleanest trail of the three, because the human approval step is a literal recorded decision with a name attached.

3. Who owns the runbook when the agent loops on a holiday

Every agent eventually does the wrong thing confidently. Ours have. The honest design question is not "will the agent fail" but "when it fails on 26 December at 14:00 while your only on-call person is in a long lunch with her in-laws, what happens next."

In a pure-human rotation this is moot — the human is the loop. In a pure-agent setup, an unsupervised loop is the worst failure mode you have, because the agent will keep taking actions and burning credibility (and money) faster than a tired human would. In a hybrid, the gate is the answer, but only if the gate has a hard timeout and a fallback owner who is not the agent itself.

Warning

If your runbook says "if no one approves in 4 minutes, the agent decides", you have a pure agent with extra steps. The gate is only meaningful if the no-approval branch escalates to a second human, not to the agent's own judgment.

Scoring the three shapes

Here is how the three options actually score, with the caveats we keep finding in production. We've shipped fourteen agents; four of them touch on-call work in some form.

Shape A: pure agent

MTTA: best on paper. We see weighted MTTA come down 60–80% versus a sleeping human, because the agent acks in about six seconds.

SOC 2 defensibility: good if you log every tool call, prompt, model version, and decision to immutable storage and you keep them for the audit window. Bad if you "just use the API and figure it out later". The auditor wants a replayable trail; you build it now or you build it under a deadline.

Holiday-loop ownership: this is where pure agents fail in real engagements. There is no human in the loop, so the failure mode is the agent quietly destroying state for an hour while the office is at Sinterklaas. Pure agent is defensible only for read-only and notify-only actions. The moment your runbook touches production state, you want a gate.

Shape B: three-person rotation

MTTA: worst on average — four to eleven minutes for a midnight page is typical, longer if the on-call person has a baby. But the long tail is shorter, because humans triage ambiguity better than current models.

SOC 2 defensibility: fine, with discipline. The trail lives in Slack and the PagerDuty timeline, both of which auditors accept without comment.

Holiday-loop ownership: this is the model's whole point. The human is the loop. The cost is human cost — a three-person rotation at fair compensation in NL is roughly €18k/year in on-call premium alone, plus the silent tax of two people sleeping with their phone face-up.

Shape C: hybrid, agent-first with approval gate

MTTA: nearly as good as pure agent for the ack, slightly worse for action time because the human approval sits in the critical path.

SOC 2 defensibility: the best of the three. Every incident has a machine-generated diagnosis, a human-recorded decision, and an automated action with a hash of the approval. Auditors notice; we've watched a Big Four senior pause and tell the founder it was the cleanest IR log they had seen on the engagement.

Holiday-loop ownership: works only if the no-approval branch escalates to a second human, not to the agent. This is the single most important design decision in the hybrid model.

The decision rule we actually apply

After about a dozen of these reviews, the rule reduces to four questions. Answer them in order.

Are more than 30% of your alerts touching production state? If no, ship a pure agent; the upside is real and the downside is bounded. If yes, continue.
Do you have SOC 2 (or are you in a sales cycle that requires it)? If yes, skip pure agent. Pick rotation or hybrid.
Is your weighted MTTA cost over €200? If yes, hybrid wins on expected value. If no, rotation is fine and probably cheaper than the engineering time to ship the agent.
Do you have two people who will reliably ACK within four minutes at 02:00? If no, you don't have a rotation; you have one person and a fiction. Hybrid is the only honest answer.

That's it. The framework runs in five minutes on a whiteboard.

A minimal hybrid loop, in code

For founders who land on Shape C and want to see what "approval gate" looks like in practice, the loop is shorter than people expect. Below is the shape we use as a starting point, stripped to the essentials and with the vendor specifics removed.

import anthropic

client = anthropic.Anthropic()
APPROVAL_TIMEOUT_S = 240  # 4 minutes

def handle_alert(alert):
    diagnosis = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        tools=READ_ONLY_TOOLS,   # logs, metrics, traces — never writes
        messages=[{"role": "user", "content": render(alert)}],
    )

    if diagnosis.proposed_action is None:
        return notify_humans(diagnosis, severity="info")

    request_id = post_to_slack_with_buttons(
        channel="#oncall-gate",
        diagnosis=diagnosis,
        actions=["approve", "deny", "page_human"],
    )

    decision = wait_for_decision(request_id, timeout=APPROVAL_TIMEOUT_S)

    if decision == "approve":
        execute(diagnosis.proposed_action, audit_trail=diagnosis.trace_id)
    elif decision == "timeout":
        page_secondary_oncall(alert, diagnosis)   # never let the agent decide here
    else:
        log_denied(diagnosis, decision)

Three details matter more than the rest of the code. The agent's tools are read-only at diagnosis time, so the worst case before approval is a wrong recommendation rather than a wrong action. The Slack message contains the full proposed action and the model version, so the approver knows exactly what they are signing off on. The timeout branch pages a human; it never falls back to autonomous execution. If you change one of those three, you don't have Shape C anymore — you have Shape A wearing a costume.

What we've learned shipping these

Two things have surprised us across the engagements where we built the hybrid version.

The first is that the human approval rate stabilises around 92–96% after the runbook has been tightened for a month. Engineers who started the project skeptical end up approving the agent's first proposal nearly every time. That is the moment to double-check your gate. High approval rates are not a signal you should remove the human; they are a signal the human is now rubber-stamping, which is a different SOC 2 problem and a different post-mortem story.

The second is that the runbook itself becomes the artifact that matters. The agent forces you to write down what you actually do at 03:00, and that document — not the agent — ends up being the most valuable output of the project. Founders who shipped a rotation alongside the runbook saw their human MTTA improve too, because the rotation was now reading from the same source the agent was.

When we built the on-call hybrid for a Rotterdam-based payments SaaS earlier this year, the thing we ran into was that their alert pipeline had no concept of "class" — every alert had the same severity, which made the weighted MTTA exercise impossible until we added a classifier in front of it. We ended up solving it by routing every alert through a small AI agent that labels and weights before anything else fires.

The five-minute thing to do today

Open your incident tracker, take the last 60 alerts, and label each one with a per-minute euro cost of being unacknowledged. If you can't, your problem isn't on-call shape — it's that you don't yet know what your alerts are worth. Fix that first, then come back to this post.

Key takeaway

For sub-€7M SaaS, the on-call decision is three axes — weighted MTTA, SOC 2 defensibility, and holiday-loop ownership — not a single ideological pick.

FAQ

Does SOC 2 allow an AI agent to take incident-response actions?

The Trust Services Criteria don't prohibit it. CC7.3 cares about a defined response process and a complete trail. Auditors are conservative, so the hybrid model (agent diagnoses, human approves) is the safest read of the criteria today.

What weighted MTTA cost should a small SaaS aim for?

Under €100 per incident is healthy for a sub-€7M SaaS. Over €400 means either your alert weights are wrong or your on-call shape is. Either way, fix the weights first; the shape decision depends on them.

What happens if the agent loops during a public holiday?

In Shape A it keeps acting until someone notices, which is the worst failure mode. In Shape C the four-minute approval timeout pages a secondary human. Never let the no-approval branch fall back to the agent's own judgment.

How much does a three-person on-call rotation actually cost in the Netherlands?

Roughly €18k/year in on-call premium for fair compensation, plus the harder-to-measure cost of two engineers sleeping with their phone face-up. The agent path trades some of that for engineering and audit work upfront.

ai agentsautomationprocess automationoperationsstrategytooling

Building something?

Start a project