Magento

Magento 2 inheritance: nine months of writes to a dead bucket

A nightly cron had been writing image backups to a deleted S3 bucket since September. The store still ran. The CFO had no idea. This is what we did in week one.

Jacob Molkenboer· Founder · A Brand New Company· 10 Jun 2024· 10 min

Open leather logbook on ivory paper, brass freight tag on twine, green receipt, cracked red wax seal on card.

Day one on a Magento 2 takeover always starts the same way. Someone hands you SSH credentials in a Notion doc, the previous lead engineer's resignation date is two weeks behind us, and the staging URL returns a 502. This one was worse than usual.

The client ran a Dutch home-goods business with three storefronts on a single Magento 2 install (one Dutch, one Belgian, one German). Combined revenue around €4 million a year. Stable orders, stable traffic, stable team. The reason we got the call was that their senior developer had quit in March, his replacement had quit in April, and the agency they were paying €8k a month had stopped returning emails.

We were given root, the AWS console, and a vague brief: "just make sure nothing breaks."

Three days later we found out something had already been broken for nine months.

What the codebase actually was

Magento 2 codebases age like milk. The version on disk was 2.4.3-p1. Current at the time was 2.4.7. The vendor folder was 1.8GB. There were 47 third-party modules, of which 12 had been forked into app/code with no record of why.

The first thing we did was run composer show -i and pipe it into a spreadsheet. Then find . -name "*.phtml" -newer composer.lock to find every template that had been hand-edited after the last composer install. That returned 312 files.

Three themes lived in app/design/frontend. One vendor theme that came with the project. One custom theme called brand-2021 that overrode about 40% of templates. One called brand-2023 that overrode about 80% but was only used by the German store. The Belgian store fell back to the parent vendor theme for several pages because the override was missing. No one had noticed, because the page in question was the GDPR consent banner, which the Belgian team had been showing in Dutch for two years.

That is the kind of thing you find on a Magento takeover. Not bugs. Drift.

The cron that wrote to nothing

The bigger discovery came when we asked the obvious question: where are the product images backed up?

The acting CTO, who was a friend of the founder and not a full-time engineer, said: "there's a nightly cron, it pushes to S3."

We checked the cron. It ran at 02:15 every night. Last successful run, according to its own log: yesterday. Lines written: about 14,000. Exit code: 0. Looked fine.

We opened the S3 console and the bucket was not there.

It had been deleted on the 4th of September of the previous year, nine months earlier, by an admin who was no longer with the company. We confirmed this in CloudTrail. The cron had been silently swallowing the failure ever since, because the aws s3 sync call was wrapped in a shell script that piped stderr to /dev/null and exited with the success code of the final echo "done" at the bottom of the file.

#!/bin/bash
# nightly-image-backup.sh
cd /var/www/html/pub/media
aws s3 sync . s3://brand-media-backup-2021/ 2>/dev/null
echo "done" # last line. exit 0 always.

Nine months of zero backups. Around 180GB of accumulated product photography sitting on a single EBS volume with no snapshots. We took a snapshot inside fifteen minutes of finding this, then wrote the email to the founder.

Warning

Any cron that "just works" for months without anyone reading its output is a cron that has stopped working. Exit codes from shell scripts lie. Always check what is actually being produced, not whether the job ran.

Telling the founder

The hardest email to write on a takeover is the one that tells the founder how broken their thing is without making them defensive. We have sent enough of these now to know the shape.

State the fact in one line. State the blast radius in one line. State what you have already done about it in one line. Offer a fifteen-minute call for questions. Do not editorialise about the previous team.

The email we sent that morning was four paragraphs. The founder called us back inside an hour, asked one question ("are we losing money right now?"), and approved the snapshot budget. Then he asked us to map every other cron on the server before we did anything else. That instinct was correct.

The triage we ran in week one

When you inherit a Magento estate that has been on autopilot for a year, you do not start refactoring. You start logging. You assume nothing in the codebase is intentional and nothing in the infrastructure is monitored.

Here is the order we worked in. It is roughly the same on every takeover.

Inventory what exists

Magento, S3, RDS, ElastiCache, the SMTP relay, the search index, the CDN, the staging box, the deploy user's home directory, every cron in /etc/cron.d and crontab -l for every user. Map it. Put it in a single document. Note who owns the credentials and when they were last rotated.

We found two cron users no one knew about. One was running an outdated cleanup script that deleted log files older than 7 days, which was why we had no logs to investigate the S3 cron with.

Find what is silently failing

The dead-bucket cron was the worst of it. There were others. A Klaviyo export job that had been returning 401 since the API key rotated. A VAT-rate sync from a Belgian tax service that had been timing out and falling back to the last cached value, which was eight months stale. A reindex cron that ran but no longer covered one of the new product attributes added in 2024.

And one we almost missed: a double-opt-in webhook had been pointed at a staging URL for three months because someone set it during a test and never moved it. Real signups were being recorded as bounced. The marketing system showed "invalid email" for around 1,400 customers with perfectly valid addresses. The marketing lead knew the number felt high. No one had connected it to the webhook, because the webhook was not in anyone's mental map of the stack.

You find these by reading every cron and asking: what would tell us if this failed? On most legacy estates the answer is nothing. No one would know.

Snapshot everything before you touch anything

Volume snapshots, database dump, S3 versioning enabled (or recreated), media folder rsync'd to a separate region. Cost us about €40 a month in extra storage. Worth it the first time something goes wrong.

Patch the security holes you already know about

In 2024 and 2025 Adobe shipped several critical Magento patches, including APSB24-40 and the follow-up bulletins. If the version on disk is more than two minor releases behind, assume it is vulnerable to at least one publicly-disclosed RCE. Patch before you do anything else, even if it means a half-day of regression testing.

This client was on 2.4.3-p1. We brought them to 2.4.7-p3 inside three weeks, in two staged releases. The first release was patches only. The second was the version jump. We split it because mixing security work with feature work makes rollback ambiguous, and on a takeover you want every rollback to be a single decision, not a debate.

Pick one theme

Three themes is two themes too many. We diffed the two custom themes against each other and against the vendor parent. About 60% of the overrides in brand-2023 were copies of the vendor template with no changes, left over from a half-finished migration. We deleted those. Of the actual changes, we kept the 2023 set and ported the three deliberate 2021 changes that were still wanted. Cut the theme count from three to one. Page weight dropped by 240KB on the Belgian store, because the GDPR banner now loaded the right CSS.

Write down the deploy

The agency had a deploy. It was an undocumented shell script in /home/deploy/release.sh that called bin/magento commands in an order that mostly worked. We rewrote it as a Makefile, committed it to the repo, and removed the script. Now anyone can read what a deploy does in thirty lines.

.PHONY: deploy
deploy: pull compile flush cron-restart

pull:
	git fetch && git reset --hard origin/main
	composer install --no-dev --prefer-dist

compile:
	bin/magento setup:upgrade
	bin/magento setup:di:compile
	bin/magento setup:static-content:deploy nl_NL nl_BE de_DE -f

flush:
	bin/magento cache:flush
	bin/magento indexer:reindex

cron-restart:
	sudo systemctl restart magento-cron

Not glamorous. A deploy is a single command and the team can read it.

The heartbeat pattern we ship every cron with now

After this takeover we adopted a rule: no scheduled job ships without a heartbeat. The pattern is small. Every cron does its work, then makes a single HTTPS POST to a hosted check (we use Healthchecks.io for most clients, an internal endpoint for the ones with stricter data rules). If the check does not receive its ping inside the expected window, it alerts a Slack channel a human reads.

#!/bin/bash
set -euo pipefail

# Do the actual work first. Errors propagate because of -e.
cd /var/www/html/pub/media
aws s3 sync . s3://brand-media-backup/

# Only ping success if every command above succeeded.
curl -fsS --retry 3 -m 10 "https://hc-ping.com/<uuid>" > /dev/null

Two details matter in the placement of that ping. The first is that it comes after the meaningful work, never before. We have inherited scripts that ping a heartbeat at the start of the job, on the theory that the cron firing at all is what matters. It is not. What matters is that the job did the thing it was supposed to do. The second is the strict shell on the first line.

set -euo pipefail is the line that would have caught the nine-month silent failure. Bash without it is the default. Bash with it behaves like a programming language that takes errors seriously. We turn it on in every script we inherit. It breaks a few things on the first deploy. Those things were already broken; the script was just hiding it.

Why this happens to Magento specifically

Magento 2 is a fourteen-year-old PHP application with a plugin model that encourages every agency to fork the core. The composer ecosystem helps, but the upgrade path between minor versions still touches third-party modules in ways that surprise people. Most estates we see have been touched by three to five different agencies over their lifetime, each leaving artefacts behind.

The deeper issue is that Magento merchants typically run lean operations teams. There is rarely a full-time engineer in-house. The store works, orders come in, and the codebase quietly accumulates exactly the kind of dead crons, abandoned themes, and silent failures we found here.

Adobe has signalled that the platform is in maintenance mode for new merchants, with Adobe Commerce increasingly marketed to enterprise. That has not slowed the long tail of mid-market merchants who built on Magento 2 between 2018 and 2022 and have nowhere obvious to go. There are tens of thousands of them in Europe alone.

The merchants who keep running Magento 2 past 2027 will be the ones who treat the platform like infrastructure rather than a project. That means a maintenance budget on the books, a documented inventory of every cron and every fork, and a deploy that any new hire can read on their first day. None of that is glamorous. All of it is cheaper than a six-month replatform that nobody costed properly because the previous lead was the only person who knew what would break.

The smallest thing you can do today

When we took over this estate, the worst surprise was a cron whose only failure mode was silence. We ended up wrapping every scheduled job in a heartbeat pattern that posts on success and fails loudly on error. Most of our process automation work on inherited codebases starts with exactly this kind of audit.

Pick one cron job on your production server. Read its actual output for the last week. Not whether it ran. What it produced. If you cannot tell, that is your first ticket.

Key takeaway

A cron that has worked silently for months is a cron that has stopped working. Exit codes from shell scripts lie.

FAQ

How long does a Magento 2 takeover usually take to stabilise?

For a mid-sized estate, three to four weeks to inventory, snapshot, patch, and consolidate. Refactoring is a separate, slower phase that runs in parallel once the estate is stable.

Should we upgrade Magento 2 or migrate off it?

Stay on 2.4.x if orders are stable and the team can maintain it. Migrating off Magento is a six to twelve month project. Most merchants are better served by patching and consolidating first.

How do we know if our cron jobs are silently failing?

Read the actual output, not the exit code. Wrap every scheduled job to ping a heartbeat service on success. If there is no signal of success, there is no signal of failure either.

Is it safe to delete an unused Magento theme?

Yes, once you have diffed it against the active theme and confirmed no storefront still falls back to it. Check store-view-level theme config in admin, not just the default scope.

magentolegacy sitesphpoperationsmigratione-commerce

Building something?

Start a project