Migration

Retiring SAP PI 7.4: the EDI cutover we ran in nine weeks

An SAP PI 7.4 estate, 14 AS2 and X.400 partners, one EDI coordinator who has not slept properly in months. Here is the cutover plan we actually shipped.

Jacob Molkenboer· Founder · A Brand New Company· 11 Jun 2026· 9 min

Leather logbook with green ribbon, brass tag, iron seal and wax-stamped card on ivory desk, forest green backdrop.

It is a Tuesday morning in February. Marleen, the EDI coordinator at a Dutch logistics integrator running 14 partner flows, forwards us an SAP notice. Extended maintenance for PI 7.4 ends inside their fiscal year. Three of their largest partners now contractually require AS2 MDNs within ten seconds. The current middleware misses that bar on a quiet day, and on Black Friday it misses by minutes.

By April we had retired the SAP PI cluster and were routing every AS2 and X.400 flow through Temporal workflows behind a Cloudflare Workers ingress. The cutover took nine weeks of parallel running, and most of the work was not what we expected at week zero. This is the playbook, in the order of operations that mattered.

The estate we walked into

One physical SAP PI 7.4 cluster on RHEL, two nodes, Oracle 19c underneath. Java mappings written between 2013 and 2019, mostly by people who had left. Fourteen partner channels: nine AS2 over HTTPS, four X.400 routed via an OpenText BizManager gateway, one private FTPS endpoint that nobody could fully explain. Average inbound volume around 8,000 messages a day, peak 40,000 around quarter-end.

The "documentation" was a 47-tab Excel file dated 2017, an AS2 partner matrix in Confluence with three different conflicting versions, and tribal knowledge in the heads of two people, one of whom had announced retirement.

The brief: keep every existing partner relationship alive without forcing certificate renegotiations, drop SAP licensing, hit the ten-second MDN clock, and finish before fiscal year-end. The implied subtext: do not, under any circumstance, lose a delivery note.

The case against buying another middleware

The honest answer is that we did consider it. SAP Integration Suite, MuleSoft, Boomi, and an OpenText replatform each got a one-page weighing exercise. Three things tipped us toward a Temporal plus Cloudflare Workers fabric:

EDI flows are long-running and intermittently flaky. Temporal's durable execution model maps to that reality without us reinventing retry, timeout, and history. The workflow primitives already encode the semantics we would otherwise rebuild.
AS2 is fundamentally HTTPS plus signed payloads and MDN receipts, defined in RFC 4130. Cloudflare Workers terminate TLS at the edge, give us per-partner client certificates through a single config object, and never run a JVM that needs patching at 2am.
The client's ops team is two people. They needed to read the runbook on a Sunday afternoon and not be afraid. JavaScript and TypeScript were already in-house. An ESB DSL was not.

X.400 was the tricky one. We kept the OpenText gateway in place and treated it as an MTA, with a thin shim that converted inbound X.400 into AS2-shaped envelopes before they hit Temporal. Replacing X.400 termination is a separate project that nobody had budget for, and that was fine.

The nine-week shape

We split the calendar into four phases. Weeks 1 to 2 were observation. Weeks 3 to 4 were build. Weeks 5 to 6 were shadow mode. Weeks 7 to 8 were partner-by-partner cutover. Week 9 was decommissioning and handover. No phase ran the same team end to end. Observation needed a senior integration engineer pairing with Marleen. Build needed two engineers heads-down. Shadow needed someone watching dashboards. Cutover needed the same engineer who built the partner channel in question.

Weeks 1 and 2: mapping the truth on the wire

Before we wrote any new code, we tapped the existing PI cluster. Every inbound AS2 request and every outbound MDN was mirrored to an S3 bucket, raw bytes plus headers, for two full weeks. By the end of week two we had 110,000 real messages on disk: anonymised, hashed by partner, replayable.

This step is the one nobody budgets and everybody needs. The Excel mapping was wrong in 11 places. Two partners were sending content-types the documentation said were impossible. One partner was reusing the same Message-ID on retries, which the PI cluster had been handling by accident.

Warning

Never start an EDI migration from documentation alone. Capture two weeks of real traffic before you decide what your new system has to accept. The wire is the contract. The wiki is the wish.

Weeks 3 and 4: the Workers ingress and the Temporal core

Architecture was deliberately boring. A Cloudflare Worker per partner-facing hostname terminated TLS and authenticated the partner's client certificate. The Worker did three things and stopped: validate that the request looked like AS2, drop the raw body into R2, and start a Temporal workflow.

export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    if (req.method !== 'POST') return new Response('method not allowed', { status: 405 });

    const partnerId = await resolvePartner(req, env);
    if (!partnerId) return new Response('unknown partner', { status: 401 });

    const body = await req.arrayBuffer();
    const messageKey = `inbound/${partnerId}/${crypto.randomUUID()}`;
    await env.R2.put(messageKey, body, {
      httpMetadata: { contentType: req.headers.get('content-type') ?? '' }
    });

    const headers = Object.fromEntries(req.headers);
    const handle = await env.TEMPORAL.startWorkflow('inboundAs2', {
      taskQueue: 'edi-as2',
      args: [{ partnerId, messageKey, headers }]
    });

    return new Response(`accepted ${handle.workflowId}`, { status: 202 });
  }
};

The Temporal workflow itself is similarly small. It decodes, dispatches to the back-office (SAP S/4HANA, in this client's case), and signs the MDN. Activities own the messy parts: certificate lookups, schema validation, the IDoc translation.

import { proxyActivities } from '@temporalio/workflow';
import type * as activities from './activities';

const { decode, dispatchToBackoffice, signAndSendMdn } = proxyActivities<typeof activities>({
  startToCloseTimeout: '90 seconds',
  retry: { maximumAttempts: 6, backoffCoefficient: 2 }
});

export async function inboundAs2(input: InboundInput) {
  const message = await decode(input);
  await dispatchToBackoffice(message);
  return signAndSendMdn(message);
}

The whole workflow file fit on a phone screen, and that mattered. The PI Java mapping it replaced was 380 lines and called four UDFs. When something broke at 3am, Marleen needed to be able to read the code, not decode it.

Weeks 5 and 6: shadow mode and reconciliation

For two weeks both systems processed every message. The Worker accepted the inbound, kicked off a Temporal workflow, and at the same time replayed the request into the existing PI cluster. The MDN that went back to the partner came from PI, untouched. The new fabric did everything in parallel except respond.

The reconciliation logic was a Postgres table and one ugly query. Every workflow wrote a shadow_runs row with its decoded fields, the back-office ack, and the proposed MDN. A scheduled job pulled the matching PI run from the legacy audit log and diffed them.

select partner_id,
       count(*) filter (where new_outcome = 'mdn_signed')              as new_ok,
       count(*) filter (where pi_outcome  = 'MDNSent')                 as pi_ok,
       count(*) filter (where new_outcome = 'mdn_signed'
                          and pi_outcome  = 'MDNSent'
                          and new_doc_hash = pi_doc_hash)              as both_ok,
       count(*) filter (where new_outcome = 'mdn_signed'
                          and pi_outcome  = 'MDNSent'
                          and new_doc_hash <> pi_doc_hash)             as diverged
  from edi_shadow_runs
 where window_start >= now() - interval '24 hours'
 group by partner_id
 order by partner_id;

We held two rules. First: any partner with non-zero diverged rows over a 24-hour window blocked cutover for that partner. Second: any partner whose new_ok was below 99.95% of pi_ok blocked cutover. Five partners passed both rules at the end of week five. Three needed schema fixes in our decode activity. Two of the X.400 partners exposed timing issues with the OpenText shim that took an extra week to settle.

Weeks 7 and 8: partner-by-partner cutover

We flipped one partner per business day. The Worker config moved that partner from "shadow PI" to "respond from Temporal." The PI channel was left running but stopped sending MDNs. We watched for 48 hours before flipping the next partner. Two partners required a tiny notice to their ops team that a new IP address was about to appear on their AS2 ACL. None required a certificate renegotiation, which had been the explicit design goal.

One partner cutover went wrong. A retry storm during a regional Azure outage on the partner's side flooded our Worker with duplicate Message-IDs. The Temporal workflow's idempotency key was the AS2 Message-ID, not a hash of body plus Message-ID, and we ended up signing two MDNs for the same logical delivery. The partner did not notice. We did, fixed the activity, and added a regression test that replays the exact pcap.

Week 9: decommissioning and the runbook

Pulling SAP PI 7.4 offline is not a button. We drained the queues, archived the Oracle schema, kept the cluster in a "read-only, no inbound" state for 30 days as an audit fallback, and only then shut it down. The Oracle licence cancellation went through finance separately and the savings landed in the next quarter's books.

The runbook was four pages. One page per common failure mode (certificate expiring, partner down, back-office down, Temporal worker queue backed up), each with the exact CLI commands to run. We pinned it in the ops Slack channel. The Sunday afternoon test was a real one: Marleen ran a simulated partner outage on her own, end to end, while we watched on read-only Zoom.

What we would do again, and what we would skip

The two weeks of traffic capture before any code shipped paid for itself five times over. Skip that, and shadow mode in week five is where the wheels come off. Building the Worker ingress as the dumbest possible thin layer was the second non-negotiable: any logic in the Worker is logic you cannot replay from R2.

What we would skip next time: writing our own diff job. There is a small but real category of EDI-aware reconciliation tools that would have saved a week. We built ours because the schema mappings were specific, but the next client will at least get the off-the-shelf option priced first.

What we underestimated: the political work. Two of the partner contacts had built personal trust with Marleen's predecessor over a decade. They wanted a Zoom call before flipping the ACL, not a ticket. We added "send a human email" as an explicit step in the cutover checklist after the third polite request.

The shape that actually mattered

If you take one thing from this playbook, take the cadence. Two weeks watching, two weeks building, two weeks shadowing, two weeks flipping, one week cleaning. Compress any one of those and the next one absorbs the cost with interest. Stretch them and the team loses the thread. Nine weeks was not a guess; it was the smallest window that gave each phase room to breathe.

When we built the EDI fabric for this logistics client, the thing we kept tripping over was the X.400 timing shim. We solved it by treating the OpenText gateway as an opaque MTA and letting Temporal own all retry semantics on our side of the wire. That decision is the one we now reach for first on any legacy migration where the protocol on the wire is older than the team supporting it.

A reasonable five-minute audit for your own integration estate: open your middleware's outbound logs from yesterday and count how many distinct error classes appeared. If that count is larger than the number of pages in your runbook, you do not have a runbook. You have a wiki.

Key takeaway

Two weeks watching, two weeks building, two weeks shadowing, two weeks flipping, one week cleaning. Compress any one and the next one pays the bill.

FAQ

Why Temporal instead of an event-driven architecture on Kafka?

Long-running retries with full audit history come built in. Kafka would have needed a state machine on top, which is roughly half of what Temporal already gives you for free.

Can Cloudflare Workers actually handle AS2 mutual TLS?

Yes. Per-zone client certificate validation terminates the partner cert at the edge. We resolved partner identity from the cert subject and rejected anything else before it hit Temporal.

How did you avoid downtime during cutover?

Two weeks of shadow processing, then one partner per business day flipped at the Worker config layer. The legacy PI cluster stayed read-only for 30 days as an audit fallback.

What was the hardest week of the project?

Week six. An X.400 timing mismatch through the OpenText shim blocked two partners from passing the reconciliation gate until we let Temporal own all retry semantics on our side of the wire.

migrationarchitectureintegrationsautomationworkflowoperations

Building something?

Start a project