Magento

Magento 2 cron stuck: 14 hours of orders lost silently

A frozen cron table, a warehouse phone going quiet, and 87 transactional emails that never sent. The Magento 2 outage every shop owner should fear, and how to catch it next time.

Jacob Molkenboer· Founder · A Brand New Company· 4 Jun 2026· 11 min

Brass pocket watch on leather blotter, cream envelope with chartreuse ribbon, broken red wax seal, ivory paper.

The call came at 09:14 on a Saturday. The warehouse manager at a mid-sized Dutch homeware brand had been packing the morning's orders when she noticed something off. The pick list was empty. Not "we shipped everything" empty, but "the system hasn't told us about any new orders since yesterday evening" empty. She opened the Magento admin, sorted orders by date, and saw forty-three new line items sitting there from the previous evening and overnight. None of them had been pushed to the WMS. None of the customers had received a confirmation email.

By the time we logged in, the count was up to eighty-seven orders. The webshop had been running normally. Customers were paying. Checkout was working. Stock was decrementing. What had stopped, fourteen hours earlier at around 19:30 on Friday evening, was the cron.

What a stuck cron actually looks like

If you have only worked with Magento 2 from the merchant side, "cron" feels like a vague backend concern. It isn't. The transactional emails that confirm a customer's order. The push to Mollie or Adyen of the captured payment state. The signal to your shipping plugin that an order is paid and ready to fulfill. The product index that makes search results match the catalog. Almost everything Magento does between the moment a customer clicks "Pay" and the moment the warehouse gets a pick list runs through the cron.

The state of those jobs lives in a single MySQL table called cron_schedule. Every job that should run gets a row, with a status field that walks through pending, running, and then either success, missed, or error. A healthy table churns. Old rows clear out, new rows queue up, and the finished_at column trails close behind scheduled_at.

What we saw on that Saturday morning was different. A query like this told the story in three lines:

SELECT job_code, status, COUNT(*) AS n,
       MIN(scheduled_at) AS oldest,
       MAX(scheduled_at) AS newest
FROM cron_schedule
GROUP BY job_code, status
ORDER BY oldest;

Two jobs sat stuck in running since 19:31 Friday. Behind them, more than nine hundred rows were piled up as pending, oldest scheduled for 19:32 Friday, newest for 09:14 Saturday. Nothing had moved for fourteen hours. The most painful row was the one we already suspected. sales_email_order_sender had run zero times since Friday evening. Eighty-seven order confirmations had been queued up, signed, and shelved.

The root cause was boring

It usually is. The shop ran Magento 2.4.4 on a single 8 GB VPS. PHP-FPM had crashed one of its workers around 19:30 during a heavy indexer rebuild, when memory pressure spiked against an open product import. The cron process that held the file lock at var/locks/cron_group_default.lock died with the worker, and the lock file itself was never released. Subsequent ticks from the system cron daemon saw the lock, decided another instance was already running, and exited silently. The cron_schedule rows that should have transitioned out of running stayed there. The pending queue grew, hour by hour, against a scheduler that had quietly stopped listening.

The host had been swap-bound for forty minutes before the worker crashed. vmstat 5 history showed the si and so columns sustained at 4-6 MB per second, which on this VPS meant the kernel was paging more memory than the disk could absorb. PHP-FPM's pm.max_children was set to 25 against a 256 MB per-worker memory limit. The arithmetic, 25 multiplied by 256, lands at 6.4 GB. That left no headroom for MySQL, Redis, and the kernel page cache once the product import opened a large file handle on top. The OOM killer reached the PHP-FPM pool first, and the cron worker holding the lock went with it. Magento was a victim of the host, not the author of its own failure.

From outside, the shop looked fine. Checkout responded. The frontend cached pages from Varnish. Only the asynchronous half of Magento, the half nobody sees from the customer side, had stopped.

Warning

If your Magento 2 monitoring only watches HTTP status codes, you cannot see this failure mode. The site will be 200 OK while orders go unshipped. You need to watch the database table, not the web server.

The fix that morning

The recovery itself took twelve minutes once we understood the shape of the problem. Adobe's own documentation on the Magento cron architecture describes the lock file mechanism and the manual recovery path. In practice we did three things, in order.

First, we killed the dead lock and any stale PHP processes:

ps aux | grep "cron:run" | grep -v grep
sudo kill -9 <pid>
sudo rm -f /var/www/magento/var/locks/cron_group_default.lock

Second, we reset the running rows so the scheduler would pick them up again rather than treat them as already in flight:

UPDATE cron_schedule
SET status = 'missed',
    messages = CONCAT(IFNULL(messages, ''), ' [reset by ops 2026-06-04]')
WHERE status = 'running'
  AND scheduled_at < NOW() - INTERVAL 30 MINUTE;

Third, we triggered the email sender by hand so the eighty-seven customers waiting on confirmations did not have to wait another cron cycle:

php bin/magento cron:run --group=default
php bin/magento queue:consumers:start sales.email.order

Within five minutes the confirmation emails started flowing. Within an hour we had cleared the backlog, contacted the customers whose orders had drifted past their promised dispatch window, and offered a small credit to the four who had already complained. Nobody churned. The shop owner did not lose a single customer over it. What she lost, and what nobody can buy back, was fourteen hours of trust in her own monitoring.

Splitting the cron groups

The recovery brought the shop back online. The configuration change the next morning made the same failure mode harder to repeat. Out of the box, Magento 2 ships three cron groups: default, index, and consumers. On a single-VPS install all three can run inside the same parent process under the same lock. That is exactly the topology that bit us. A heavy reindex starved the transactional jobs that sent confirmation emails and pushed orders to the WMS, because they were waiting on the same lock the reindex was holding.

The fix has two parts. First, give the indexer group its own crontab entry, run by its own PHP process, with its own lock file:

# /etc/cron.d/magento-default
* * * * * www-data /usr/bin/php /var/www/magento/bin/magento cron:run --group=default \
  2>&1 | tee -a /var/log/magento/cron-default.log

# /etc/cron.d/magento-index
* * * * * www-data /usr/bin/php /var/www/magento/bin/magento cron:run --group=index \
  2>&1 | tee -a /var/log/magento/cron-index.log

Second, shorten the schedule history for the index group so its rows do not crowd the cron_schedule table. The default success lifetime is sixty minutes, which is reasonable for a once-an-hour transactional job but generates a lot of churn for an indexer that runs every minute. We dropped ours to ten, with a longer failure lifetime so error rows stay visible for the morning standup:

<!-- app/etc/config.php (partial) -->
<config>
  <cron>
    <jobs>
      <indexer_reindex_all_invalid>
        <schedule>
          <history_cleanup_every>10</history_cleanup_every>
          <history_success_lifetime>10</history_success_lifetime>
          <history_failure_lifetime>60</history_failure_lifetime>
        </schedule>
      </indexer_reindex_all_invalid>
    </jobs>
  </cron>
</config>

After splitting the groups and tuning the lifetimes, the cron_schedule table dropped from roughly twelve thousand rows in steady state to under four hundred. The two crontabs hold independent lock files. A stuck index group can no longer freeze the default group. The transactional half of the shop now has its own narrow lane, and the indexer has its own.

Three alerts we wired in that afternoon

The interesting work started after the dust settled. Magento has been around long enough that "watch the cron" is well-trodden ground, but most of the public guidance is generic. We wanted three specific alerts, each tuned to a different failure mode of the same underlying problem. Each one had to fire within fifteen minutes, and each one had to be cheap enough that a small in-house team could keep them green without a dedicated SRE.

Alert 1: cron_schedule heartbeat

This is the obvious one. Once every five minutes a small script queries the schedule table and checks how long it has been since the most recent job moved to success. If that gap exceeds fifteen minutes, the script posts to a Slack webhook. The query is shorter than the alert description:

SELECT TIMESTAMPDIFF(MINUTE, MAX(finished_at), NOW()) AS quiet_minutes
FROM cron_schedule
WHERE status = 'success';

The threshold of fifteen minutes is not arbitrary. Magento's default cron group ticks once a minute, and the slowest "normal" job we measured on this shop ran in around four minutes. Fifteen minutes gives us a safety margin without firing on a slow reindex. The script runs as its own system cron, separate from Magento's own crontab, so a frozen Magento cron cannot also freeze the watchdog.

Alert 2: orders without a confirmation email

The heartbeat catches the technical fault. This second alert catches the consequence the customer actually feels. Every ten minutes the script asks a simple question of the sales_order table: are there any paid orders from the last hour that do not yet have email_sent = 1?

SELECT entity_id, increment_id, created_at
FROM sales_order
WHERE state IN ('processing', 'complete')
  AND email_sent IS NULL
  AND created_at < NOW() - INTERVAL 30 MINUTE
  AND created_at > NOW() - INTERVAL 6 HOUR;

If that query returns more than three rows, we treat it as an incident. The thirty-minute grace window allows for legitimately slow email runs during a flash sale. The six-hour ceiling stops the alert from screaming about a historical mishap forever. This one is the alert that would have woken us up on Friday night, hours before the warehouse manager noticed on Saturday morning.

Alert 3: lock file age

The third alert is the cheap belt-and-braces. The var/locks/ directory should never have a lock file older than the longest legitimate job. We picked thirty minutes as the cutoff. A one-line shell script, run from the system crontab, does the work:

find /var/www/magento/var/locks -name '*.lock' -mmin +30 \
  -exec echo "STALE_LOCK: {}" \;

Any non-empty output triggers a Slack post. We piped the output through a tiny Python helper that adds the hostname and the lock file's age in minutes, so the on-call engineer sees the full context in one line rather than a bare filename.

Why three, not one

The temptation when you build incident monitoring is to centralise it into a single dashboard with a single rule. We learned the hard way that one alert means one failure mode. Two alerts mean you catch the failure even when the alert itself fails. Three alerts mean you can sleep on weekends.

Each of the three above watches a different layer. The heartbeat watches the scheduler. The email check watches the customer-visible outcome. The lock file check watches the operating system. Any one of them, on its own, would have caught the Friday outage within fifteen minutes. Together, they make the failure mode we lived through that morning effectively impossible to repeat unmonitored.

There is a second reason for the trio that we only really noticed three months later, in an unrelated incident on a different shop. A single alert tells you that something is broken. Three alerts firing in sequence tell you which layer broke first. On a frozen scheduler, the lock-file watcher fires within thirty minutes. The heartbeat fires fifteen minutes after that. The email-sent check fires a few minutes later still. Reading the order of the pages tells the on-call engineer to look in the lock directory before they SSH into MySQL. That diagnostic shortcut is worth the small extra effort of running three checks instead of one.

The remaining advice is the boring kind that always shows up at the end of these stories. Keep queue_message_status small by tuning its lifetime. Watch memory pressure on the PHP-FPM pool, because a hung worker is far more often the upstream cause than a Magento bug. Adobe's Experience League guide on managing cron jobs has the canonical commands for splitting groups and running consumers separately. And test the alerts by deliberately killing the cron in a staging environment, because the worst time to discover that your Slack webhook expired six months ago is the moment a real incident starts.

The five-minute audit

When we wired this monitoring layer for the homeware client's Magento 2 stack, the thing we kept coming back to was that none of the three alerts required new infrastructure. The queries above run on the existing MySQL instance. The watchdog scripts run as plain system cron jobs. Slack is free. If you run a Magento 2 shop and you want to know in the next five minutes whether you carry the same hidden risk, open a SQL client against your production database, paste the heartbeat query at the top of this post, and look at how long it has been since the most recent finished_at. If the answer is more than fifteen minutes, you already have your first alert to write. We have built variations of this process automation for half a dozen Magento and WooCommerce shops, and the pattern is always cheaper, smaller, and duller than the outage that triggered it.

Key takeaway

If your Magento 2 monitoring only watches HTTP status codes, you cannot see a stuck cron. The site stays 200 OK while orders go unshipped.

FAQ

How often does Magento 2 cron actually get stuck?

More often than vendors admit. On any shop running heavy indexer work on a single VPS, expect at least one stuck cron event per year. The fix is monitoring, not Magento upgrades.

Will upgrading to Magento 2.4.7 fix this?

It reduces the frequency. The split message queue and consumer process model in newer releases make some failure modes recoverable. It does not remove the need for the three alerts described here.

Can I rely on Magento's own cron monitoring panel?

The built-in System > Tools > Cron Group view shows you what happened. It will not tell you what stopped happening, which is the failure mode that costs you orders. You need an external watchdog.

What is the smallest version of this I can run today?

One cron entry on your VPS that queries cron_schedule every five minutes and posts to a Slack webhook if the most recent finished_at is more than fifteen minutes old. Twelve lines of bash plus a SQL query.

magentophpmysqloperationscase studye-commerce

Building something?

Start a project