Integrations

Make and Zapier audit: secrets, retries, silent fails

Forty-seven scenarios. Half in error for nine weeks. One had sent the same dunning email 412 times. This is the audit we run before we take over.

Jacob Molkenboer· Founder · A Brand New Company· 5 Jun 2026· 7 min

Ivory desk with leather blotter, brass tally counter, carbon receipts, green index tab, brass switch, broken wax seal.

The client emailed last week with the line we hear every quarter: "Can you take over our Zapier account? Our automation person left in March, nothing's been touched since." We logged in. Forty-seven scenarios. Half of them sitting in error states going back nine weeks. One had quietly sent the same dunning email to the same customer 412 times before someone in finance noticed.

We don't agree to maintain a Zapier or Make account on sight. We audit it first. The audit takes about three hours per hundred scenarios, and the answer is sometimes "we can't maintain this, it needs a rebuild." Three categories of damage show up over and over: rotting secrets, retry storms, and silent failures nobody sees until they surface in a customer support ticket.

Here is the checklist we run.

Secrets and connections

Open the connections panel first. Both Make and Zapier list every OAuth grant and API key the account holds. Most accounts we inherit have somewhere between fifteen and forty. Three things we look for.

Stale OAuth grants. Connections tied to ex-employees' Google or Microsoft accounts. The scenario runs fine until that employee is deprovisioned, then it fails silently (more on that below). Anything tied to a personal Gmail or a former staff member gets flagged immediately.

API keys with no rotation history. A static Stripe key, a Salesforce token, an SMTP password. If it has been rotated zero times since 2023, it is a liability. Check whether the source system supports rotation without downtime. Stripe and Slack do. Some older SaaS tools require you to delete and reauthorize, which means scheduled downtime.

Overscoped permissions. A scenario that only reads customer rows should not be holding a full-admin Salesforce token. We don't always fix this on day one, but we note it and tighten when we rebuild.

Warning

The fastest way to inherit a security incident is to take over an automation account without rotating credentials. Anyone who ever had access still has it. We make rotation a precondition of maintenance, not a follow-up.

Retry behavior and the cost of a stuck queue

Make and Zapier both retry failed steps. The defaults are sensible for most steps. They are dangerous for two specific shapes of scenario.

Outbound email and SMS. If a scenario sends a notification and the downstream system (Postmark, Twilio, your own SMTP) returns a 500, the platform will retry. If your scenario doesn't deduplicate on the retry, you send the message twice. If the upstream trigger fires fast enough and the downstream API is genuinely down for an hour, you can send the same message hundreds of times before the queue catches up.

We saw this exactly once with a Make scenario that watched a Google Sheet for new invoice rows and pushed them into a payment-chase email. Sheet appended a row, SMTP timed out, Make retried. Another row landed in the meantime. Each retry now processed both rows. The next retry processed three. By the time the team noticed, one customer had received forty-two identical dunning emails and replied to one of them with a screenshot of the inbox.

Webhook fan-outs. A scenario triggered by a webhook that calls five downstream APIs. If step three fails, Zapier reruns the entire scenario, which means steps one and two execute twice. If steps one or two are not idempotent (creating a record, posting to Slack, charging a card), you have now created duplicates.

The audit fix is to inventory every scenario that writes to an external system and ask: if this step ran twice, would it cause damage? If yes, the scenario needs an idempotency key, a dedup lookup, or a queue with a guard. Often the cheapest fix is to push the actual write into a small webhook endpoint we control, where we can enforce idempotency in code.

// Tiny idempotency guard in a Vercel/Cloudflare function
// that Make or Zapier calls instead of writing directly.
import { kv } from '@vercel/kv'

export default async function handler(req, res) {
  const key = req.headers['x-scenario-key']
  if (!key) return res.status(400).json({ error: 'missing key' })

  // First writer wins; everyone else gets a 200 with no side effect.
  const ok = await kv.set(`run:${key}`, 1, { nx: true, ex: 86400 })
  if (!ok) return res.status(200).json({ deduped: true })

  await doTheActualWork(req.body)
  return res.status(200).json({ ok: true })
}

The platform passes a stable key per logical run, we refuse duplicates for 24 hours, and a Make retry storm becomes one write plus a stack of harmless 200s.

The silent-fail trap

This is the one that does the most damage and is the hardest to spot.

Both Make and Zapier offer "stop on error" and "continue on error" branches. The convenient choice, the one most no-code builders pick, is to set the scenario to email the builder if anything fails. That works fine until:

The builder leaves the company.
The notification goes to a shared inbox nobody reads.
The error notification itself depends on the broken connection (a Slack channel that was deleted, a Gmail account that was deprovisioned).
The scenario is set to "continue on error" and the failed step is the one that actually mattered (the invoice posted, but the matching ledger entry didn't).

Half the scenarios we inherit are silently failing on one of these shapes. The customer never knows because the dashboard shows "Success" for the parent scenario and the buried error sits in the run history. Make's run history goes back thirty days on most plans. Zapier's task history depends on tier. After that, the evidence is gone.

Our audit step here is simple: pull the last thirty days of run history for every scenario and count the actual success rate by counting downstream side effects, not the platform's success flag. If a scenario is supposed to create a Salesforce record on every run, we run the count in Salesforce, not in Zapier. The two numbers almost never match.

Takeaway

The platform's "success" flag means the workflow finished without throwing. It does not mean the work actually got done.

What we do with the audit

We hand the client a one-page report per category, with a recommended action for every scenario: keep, rotate, rebuild, retire. The "retire" column is usually the most surprising one. About a third of scenarios in any account we audit are running for no living reason. The trigger system was replaced two years ago. The Slack channel was archived. The recipient left. Nobody turned the scenario off because nobody knew it existed.

The other thing the audit gives the client is a defensible baseline. If you don't know what's running today, you can't tell tomorrow whether a regression came from your change or from the scenario rotting on its own.

For the scenarios that survive the audit, we wire up a real error pipeline. Every scenario reports failures to a single channel we own. Every connection has a documented rotation cadence. Every write step is idempotent or guarded. The Slack alerts include a run link and the last 200 characters of the error, so the on-call person can triage without opening the platform UI.

When we ran this audit for a Dutch B2B distributor with forty-six live Make scenarios, the result was thirteen kept, eleven rebuilt on a stricter pattern, and twenty-two retired. The dunning-loop scenario that had been double-sending for weeks is now a small Node service we own with a real idempotency table, and that kind of process automation rebuild is roughly half our integrations work.

If you take one thing from this post: open your Make or Zapier account this afternoon, sort scenarios by last error date, and count how many have been red for more than thirty days. That number is the size of the problem you don't yet know you have.

Key takeaway

The platform's success flag means the workflow finished without throwing. It does not mean the work actually got done.

FAQ

How long does a Make or Zapier audit take?

About three hours per hundred scenarios for the initial pass. Deeper work on idempotency and rotation is separate, scoped per scenario based on what the audit flags.

Why rotate credentials before taking over an account?

Anyone who ever had access still has it. Stale OAuth grants tied to ex-employees and unrotated API keys are the most common way an inherited automation account becomes a security incident.

What is the silent-fail trap?

A scenario set to continue on error, or alerting through a connection that itself broke. The platform shows success while real work stops. You only find it when a customer complains weeks later.

Can we just keep using Make or Zapier after the audit?

Often yes. Some scenarios get rebuilt as small services where idempotency matters, but the no-code platform stays where the logic is simple and the cost of a duplicate run is zero.

How do you alert on failures without depending on the broken connection?

Alerts go through a separate channel we own, with its own credentials, that does nothing else. If a customer scenario breaks, the alert path is not on the same fault domain.

automationintegrationsworkflowprocess automationsecurityoperations

Building something?

Start a project