Drupal
Drupal 11 upgrade rollback: the orphaned UUID war story
Nine days into a Drupal 9 to 11 upgrade, an Antwerp software vendor found 18,400 support cases pointing at UUIDs that no longer resolved. Here is what went wrong.

It was a Tuesday in May when the lead developer at the Antwerp manufacturing-software vendor opened a Slack thread titled "we are going to miss the deadline." Their Drupal 9 site had been frozen mid-upgrade for nine days. The customer portal, which ran 18,400 active support cases for clients across Belgium and the Netherlands, was technically still online. But every link from a customer node to an open case now pointed at a UUID that no longer resolved to anything. The Drupal 11 target environment was running. The Drupal 9 source was untouched. And somewhere in between, the migration had quietly rolled itself back and left the artefacts behind.
That sentence is the entire post in one paragraph. The rest is how it got there, why it was silent, and what we run now to stop it happening again.
The setup that broke
The vendor had been on Drupal 9 since 2021. They had skipped Drupal 10 because the team wanted to do one upgrade, not two. That instinct is reasonable. Drupal core ships a clean path from 9 to 10 and from 10 to 11, and the official Drupal upgrade documentation lays out a contrib-module audit you are supposed to run before you start. What gets people in trouble is custom fields, custom entity types, and the layers of contrib modules that quietly bind to deprecated APIs.
Their customer-portal node type had an entity-reference field, field_assigned_engineer, pointing at user accounts. It had been added in 2019 on Drupal 8, ported to 9, and never touched since. Somewhere along that journey the field's storage configuration had drifted. The schema in the database said one thing, the field config YAML said another, and the entity definition cache had been holding the union of both for years. Drupal 9 did not care. Drupal 10's stricter entity definition update manager did.
What "silent rollback" actually means
Most Drupal upgrade documentation describes the Migrate API as transactional. It is, mostly. When a migration row fails to import, it is supposed to be marked as MIGRATE_STATUS_FAILED in the migrate_map_* tracking table, and the row is supposed to either be skipped or retried on the next run.
What is less well-documented is what happens when the failure is not in a row but in the migration's prepareRow setup. If a source plugin throws after some rows have already been written to the destination, Drupal does not always reverse those writes. It marks the migration as incomplete, logs a single line to dblog, and stops. The destination keeps whatever it had at the moment of the throw. The source plugin marks itself for retry. The migrate map ends up with a partial view of reality.
In the Antwerp case, that meant about 11,000 of the 18,400 support cases had been migrated as new nodes with fresh UUIDs in Drupal 11. The remaining 7,400 had not. The field_assigned_engineer field on every migrated case referenced engineer accounts by their old Drupal 9 UUIDs, which had not been migrated yet because the user migration was queued to run after the support case migration finished. Which it never did.
The team's monitoring saw "migration in progress" for nine days because the cron job was technically still scheduled. Nobody got paged. The Drupal 11 site was passing health checks because the front-end node listings rendered fine. It was the deep-link from a customer dashboard to "your open cases" that returned empty arrays. Customers noticed before the team did.
A Drupal migration that has not thrown a fatal error is not the same as a Drupal migration that has finished. Check drush migrate:status, not the cron log.
The deprecated property that started it
The trigger was a single deprecated property on the entity-reference field's storage settings. In Drupal 9, the target_type could be left implicit if the field was attached at install time to a single node type. Drupal 10 made it explicit, and the change record on Drupal.org flagged this as a hard requirement, not a soft warning. Anyone who ran drush updb on an unprepared field got a warning, not an error. The update database routine completed. The field continued to work in the UI. The Migrate API, however, started reading the field's target_type from the new explicit location, which was empty.
When the support case migration tried to resolve the engineer reference, it found a target_type of null, decided it could not safely write the reference, and threw a MigrateSkipRowException. After the third such skip, the migration's batch threshold hit and the whole batch quietly rolled back to the last savepoint. The next batch never started, because the batch runner thought it was still inside the prior one.
To reproduce locally, the team ran this against a clone of production:
drush --uri=https://portal.local migrate:status --group=customer_portal
drush --uri=https://portal.local migrate:messages support_case --limit=5
drush --uri=https://portal.local sql:query \
"SELECT COUNT(*) FROM migrate_map_support_case WHERE source_row_status = 2"
The third query returned 7,412. That number matched the gap between cases that customers could see in the old portal and cases that the new system had loaded.
The recovery, in the order we ran it
We were called in on day eight. The first decision was whether to roll forward or roll back. Rolling back meant restoring the Drupal 9 database from before the upgrade started, which was easy, except that customers had filed 312 new cases against the broken Drupal 11 site in the meantime. Losing them was not an option.
Rolling forward meant repairing the entity-reference field's storage configuration in place, re-running the failed migration with --update, and reconciling the 312 net-new cases by hand. That is what we did. The steps:
# 1. Lock the site to read-only for engineers
drush state:set system.maintenance_mode 1
# 2. Repair the field storage config to match the new explicit shape
drush config:get field.storage.node.field_assigned_engineer > /tmp/before.yml
drush php:eval "
\$config = \Drupal::configFactory()->getEditable('field.storage.node.field_assigned_engineer');
\$config->set('settings.target_type', 'user')->save();
\Drupal::entityDefinitionUpdateManager()->updateFieldStorageDefinition(
\Drupal::service('entity_field.manager')->getFieldStorageDefinitions('node')['field_assigned_engineer']
);
"
# 3. Clear the migrate map for failed rows only, NOT the imported rows
drush sql:query "DELETE FROM migrate_map_support_case WHERE source_row_status = 2"
# 4. Re-run the user migration first, then the support cases against it
drush migrate:import user --group=customer_portal
drush migrate:import support_case --update --group=customer_portal
# 5. Reconcile the 312 cases filed during the outage
drush scr scripts/reconcile-orphan-cases.php
The reconciliation script walked the new cases, matched them against the original Drupal 9 customer UUIDs using a lookup table the team had been smart enough to keep, and rewrote the references. It ran in 47 seconds. The site came back up four hours later, once the team had verified that random spot-check cases resolved end-to-end.
What two hours of pre-flight would have caught
This was preventable. Not in hindsight, but in foresight. The mistakes that lined up to make it happen are the same ones we see on every legacy Drupal upgrade. So here is what we now run before anyone touches drush updb on a production database.
Audit every field storage for implicit-to-explicit drift
Run Upgrade Status and read every warning, not just the errors. Then for every entity-reference, file, image, and taxonomy-term field, dump the storage YAML and confirm that the target_type, target_bundles, and handler_settings are all populated. If any of them are empty or absent, fix them on the source before the upgrade, not after.
for field in $(drush config:list | grep '^field.storage.node'); do
drush config:get $field | grep -E '^\s*(target_type|handler):' \
|| echo "MISSING: $field"
done
Run the migration in dry-run mode against a production clone
Drupal does not have a native dry-run for the Migrate API. You can fake it well enough with a destination plugin override that writes to a shadow table. Spend a day building that harness. It will save you nine days later.
Watch the migrate map, not the cron log
Add a check to your monitoring that runs drush migrate:status every fifteen minutes during an upgrade window and pages you if any migration has source_row_status = 2 rows older than ten minutes. The Drupal 11 dblog will not save you. Your monitor will.
The dangerous Drupal upgrade failures are the silent ones. Treat any migration that has not finished as a migration that has actively failed.
Why this keeps happening across the contrib stack
The Antwerp team's field was custom, but the same pattern bit two contrib modules they relied on. Group had a similar implicit-target-type field in an older 2.x branch. Webform had a more subtle version of the same drift in its handler settings. Both had been patched upstream, but neither patch had been pulled into the team's composer lockfile because they were pinned to a version that predated the fix.
The contrib ecosystem is one of Drupal's strengths. It is also the thing that breaks first on a multi-major upgrade. Spend an afternoon reading the change logs of every contrib module in your composer.json, going back to the version you are running, and you will find at least one of these silent drifts on most sites.
The order of operations we would have used on day one
If we had been in the room before the upgrade started, the only thing we would have changed was the order of operations. Migrate the source content fields to the explicit configuration shape on Drupal 9 itself, before introducing any version change. The Drupal 9 site does not need the explicit shape. But Drupal 10 will, and shipping that one change while the rest of the system is unchanged isolates the failure mode. If something breaks, you know it is the field, not the upgrade.
The team has since adopted this. Every storage drift gets fixed on the current major before they consider moving. The next upgrade is scheduled for September. We will be there for the pre-flight.
When we ran the legacy migration for the Antwerp vendor's Drupal 11 cutover, the thing that bit us was not the Drupal 11 changes themselves but seven years of drift between what the database thought the field was and what the exported config said. We ended up writing a one-page checklist that runs against the source site before anyone touches the target, and that checklist now runs against every Drupal client on our maintenance shelf.
If you are sitting on a Drupal 9 site today, the smallest useful thing you can do this afternoon is dump the field storage YAML for every entity-reference field on your content types and grep for empty target_type values. Five minutes of work. The output is either reassuring, or, for about one in three sites we audit, the start of a conversation.
Key takeaway
A Drupal migration that has not thrown a fatal error is not the same as a finished one. Watch the migrate map, not the cron log.
FAQ
What caused the silent Drupal migration rollback in this case?
An entity-reference field with an implicit target_type was left in its Drupal 9 shape. Drupal 10's Migrate API read the explicit target_type as null and threw a MigrateSkipRowException, which tripped the batch rollback threshold.
How do I detect a stalled Drupal migration before customers notice?
Monitor source_row_status in the migrate_map_* tables and alert on rows stuck at status 2 for longer than ten minutes. Cron logs and dblog will not surface this on their own.
Can I safely skip Drupal 10 and upgrade from 9 directly to 11?
No. Drupal 11 has no direct upgrade from 9. You must go 9 to 10 first, then 10 to 11, and the deprecation surface from each step needs to be cleared before moving on.
Why does an empty target_type break entity-reference fields?
Drupal 10 onwards expects target_type set explicitly in field storage. When it is empty, reference resolution returns null instead of an entity, and Migrate refuses to write the reference rather than corrupt it.
Is rolling forward always safer than restoring the old database?
No. It depends on whether new writes have hit the broken target environment. If customers have created records you cannot afford to lose, roll forward. If nothing has been written, the clean restore is usually faster and lower risk.