Security
Zendesk and Intercom audit: what we score before AI retrofit
Tuesday morning at 10:00. We're on a screen-share with a Dutch SME — 22 agents on Zendesk, Intercom on the widget — and they want an AI retrofit. First, the audit.

Tuesday, 10:00. Screen-share with the operations lead at a Dutch e-commerce brand, somewhere between €8M and €15M revenue. Zendesk Professional, 22 agents, Intercom for the website widget, Klaviyo plugged in for the post-purchase loop. They want to retrofit an AI agent on top to triage 60% of inbound. The deck the partner sent over says “two weeks live.” We say: first, the audit.
We run the same checklist on every sub-€20M Zendesk plus Intercom tenant before we quote. Not because we’re allergic to short engagements — we like short engagements — but because the price of a support-agent retrofit is set by the mess inside the tenant, not by the volume on the dashboard. Two warehouses identical on paper: one ships in three weeks, the other takes five months. The difference is what the audit finds.
Here is the checklist, in the order we run it.
Trigger-cascade ordering on the top 35 macros
Zendesk runs triggers in a fixed sequence on every ticket event. Macros are agent-applied templates, but the real work happens in the trigger chain they fire by setting tags, fields, and groups. On a typical SME tenant we find between 80 and 150 active triggers and between 200 and 400 macros. We score only the top 35 macros — the ones used more than five times per week per agent — because that is what the AI agent will replicate in production. The rest are inventory.
For each macro we walk the trigger chain it activates and check three things: does any subsequent trigger reset a field the macro just set; does a tag-reading trigger fire before the tag-setting trigger; does a notification trigger fan out to a Slack channel that was archived last August. Scored clean, out-of-order, or silently broken.
Macro | Triggers fired | Order | Verdict
----------------------------|------------------------|-------|---------------
Refund request - eligible | set_refund_tag, | OK | clean
| route_to_finance | |
Track-and-trace | set_tnt_tag, | BAD | out-of-order
| enrich_from_PostNL | | (read before set)
Wrong size - return label | set_return_tag, | OK | clean, but
| notify_warehouse_slack | | slack room dead
The reason this matters for an AI retrofit: a trigger that silently fails today still functions for human agents, because the human compensates. They copy-paste the tracking number into the reply by hand. The AI agent inherits the trigger as the contract and ships the broken state into production. We have watched a “ready to go live” agent send 412 customers an empty tracking placeholder over the course of a Saturday morning before anyone noticed.
Conversation-tag drift across the last 90 days
Pull the CSV of every tag applied on a ticket in the last 90 days. Group by frequency, sort descending. A healthy tenant compresses the long tail: 80% of tickets touch 20 to 30 tags, and the rest are bounded categorical — country, channel, product line. A drifted tenant has 600-plus unique tags, 80% of them used once or twice, mostly typos (refund_requestd), dead campaign codes (q4_2023_blackweek_DE), and abandoned experiments (test_agent_routing_v2_FINAL_NEW).
Drift kills retrofits because the training set for “what does this ticket look like” becomes noise. The agent learns to predict the long tail, which is by definition not predictable. We measure three numbers per tenant:
- Tag entropy — Shannon entropy across the tag distribution. Below 4.5 bits is healthy. Above 6 means anything could happen on any ticket.
- Orphan rate — percentage of tags used fewer than three times in 90 days. Below 30% is healthy. We have seen 78%.
- Overlap — manually scored, by us. Three different tags that mean “return shipment from Belgium” reduce to one. We do this by hand because regex won’t catch the synonyms.
A small jq snippet on the tag export, if you want to score it yourself before the meeting:
cat tags_90d.json \
| jq -r '.[] | .tag' \
| sort | uniq -c | sort -nr \
> tag_counts.txt
# orphan rate
awk '{ if ($1 < 3) o++; t++ } END { print o/t }' tag_counts.txt
If the orphan rate prints above 0.5, the AI retrofit conversation should start with a tag-merge sprint, not with a model choice.
Side-conversation API rotation and the AVG context-trail
This is the section that kills most quotes, and the one nobody flags in the discovery call.
Zendesk side conversations are how an agent loops in a warehouse, a supplier, a 3PL — without exposing the customer thread. They run over external transports: email, Microsoft Teams, Slack. Each transport authenticates with credentials stored in Zendesk’s integration layer. Under any reasonable security posture, those credentials rotate every 90 days. When they rotate, three things must not break.
- Webhook delivery for new replies on existing side conversations.
- Attribution metadata on each message — who said what, when, on which channel.
- The export endpoint that serves an AVG / GDPR Article 15 data-subject request, which legally must include side-conversation content where the customer’s data appears.
We test this by rotating the credential in a staging tenant, then immediately exporting a ticket that has an active side conversation and checking the JSON for completeness. Of the last ten Dutch SME tenants we audited under this protocol, three survived the rotation cleanly. Six retained the message bodies but lost attribution — the side-conversation messages came back as “external participant,” date stamps intact, but the named sender wiped. One tenant lost the messages entirely from the export, because the integration token was scoped only to that single conversation thread; rotating revoked retroactive access.
If side conversations lose attribution on credential rotation, the AVG Article 15 export is incomplete and the processing register under Article 30 doesn’t match reality. The Autoriteit Persoonsgegevens treats both as findable in an audit.
The retrofit risk: an AI agent that fans out to side conversations on the customer’s behalf inherits this fragility. If the agent’s reasoning trail depends on side-conversation context — “I asked the warehouse, they said yes, therefore I confirmed to the customer” — and that trail vanishes on the next key rotation, you have a black-box decision in production that you cannot reconstruct. That is not an AVG problem in theory. It is an AVG problem the first time a customer asks why.
The scoring sheet
We compress the audit into one page, five categories, each scored 0 to 3. We won’t quote a retrofit on a tenant scoring below 7 out of 15. Below 7, we quote a cleanup first and revisit the retrofit after.
Category | Score 0-3 | Notes
---------------------------------|-----------|---------------------------
Trigger health (top 35 macros) | | clean / out-of-order / broken
Macro coverage vs ticket reasons | | does the library match
| | what actually comes in
Tag entropy and orphan rate | | entropy < 4.5, orphan < 30%
Side-conversation rotation | | survives the staging test
AVG context-trail integrity | | export reproduces the thread
---------------------------------|-----------|---------------------------
Total | /15 | <7 = cleanup before retrofit
The number that matters most is the last one. We have shipped retrofits on tenants with messy macros — humans wrote those macros, humans can clean them in a sprint. We have walked away from tenants whose AVG trail couldn’t survive a credential rotation, because that is a structural problem with the integration topology and it is not the AI agent’s job to fix it. It is the platform team’s job, before anyone bolts an agent on top.
What the three survivors had in common
Of the three tenants from our last batch of ten that passed the side-conversation rotation cleanly, none had the most expensive Zendesk plan, none had a dedicated platform engineer, and none had recently re-platformed. What they had was boring: one named owner of the integration credentials, a 90-day rotation actually written down in a runbook, and side conversations restricted to two transports rather than five. Less surface, fewer secrets, one person who knows where the keys live. That is the entire pattern.
What to do today
Open Zendesk, export the last 90 days of tag activity, run the jq snippet above, and read the orphan rate out loud. If it is over 50%, the AI retrofit conversation should start with a tag cleanup, not a model choice. When we built the support agent for a Dutch homeware brand last winter, the thing that nearly derailed the project was exactly this — 612 unique tags in 90 days, no taxonomy, no owner. We ended up solving it with a one-week tag-merge sprint before any AI agent work began, and the retrofit shipped in the original three-week window.
Export the tags. Read the orphan rate. The audit starts there.
Key takeaway
The retrofit price is set by the mess in the tenant, not the ticket volume. Score trigger order, tag drift, and side-conversation safety before you quote.
FAQ
What does a passing score on the audit look like?
Seven or higher out of fifteen, with a non-zero on side-conversation rotation. Anything lower means a tag or integration cleanup before we quote the retrofit.
Why score only the top 35 macros and not all of them?
Because the AI agent models actual usage, not the macro library. Macros used fewer than five times per week per agent rarely make it into the agent's behavior surface.
How long does the full audit take?
One working day for a sub-€20M tenant with a single brand. Two days if multiple brands share the Zendesk instance with overlapping macros and tag namespaces.
Does this audit apply to Intercom-only tenants?
Most of it. Trigger ordering becomes workflow ordering, side-conversations become Intercom's collaborator inbox. Tag drift and AVG trail tests are identical.