Drupal

Drupal 10 migration: rebuilding 14,000 lost taxonomy refs

The migrate batch finished green at 22:14. QA found a page with no tags at 23:40, then 14,000 more by 02:00. The site was due live at 09:00.

Jacob Molkenboer· Founder · A Brand New Company· 5 Jun 2026· 11 min

Open leather ledger on ivory paper, brass key on spine, torn card with green ribbon, cracked red wax seal, side light.

The migrate batch finished at 22:14. We watched the last of the 31,000 node rows tick into the new Drupal 10 site, ran a quick smoke check on three randomly picked content pages, called it done, and went to bed. At 23:40 the client's QA lead opened a content page and noticed the tags strip was empty. By 00:20 we had a sample of 40 pages with no tags at all. By 02:00 we knew the number was closer to 14,000. The site was supposed to be live at 09:00 for the client's launch announcement.

This is the story of a quiet failure in Drupal's Migrate API, the diagnosis we ran at 03:00, and the SQL we used at 04:30 to put the references back where they belonged.

The setup

The job was a Drupal 7 to Drupal 10 migration for a mid-sized publisher. About 31,000 nodes across six content types, three taxonomy vocabularies with about 2,200 terms, and a long tail of entity_reference fields linking nodes to terms. The migration pipeline followed the standard pattern: terms first, then media, then nodes. The full plan ran inside drush migrate:import --tag=publisher_d10.

We had rehearsed this twice on staging without obvious failure. The dry runs both showed Processed: 31,142, Created: 31,142, Updated: 0, Failed: 0. The same line printed on production. No errors went to the log.

How the failure happened

The dropped references all came from one field, field_topic, an entity_reference field to the topics vocabulary. It was the only field on the migration that used a custom migration_lookup chain because the source data stored taxonomy as a delimited string in a Drupal 7 text field instead of a proper taxonomy_term_reference. The 2018-era developer who built the original site had bolted it on that way to avoid a node-edit form they did not like.

The process pipeline read roughly like this:

field_topic:
  - plugin: explode
    delimiter: '|'
    source: field_topic_raw
  - plugin: callback
    callable: trim
  - plugin: migration_lookup
    migration: publisher_topics
    no_stub: true
  - plugin: skip_on_empty
    method: process

That last line is the bug. skip_on_empty with method: process does not skip the whole row, it skips the field. When migration_lookup returned NULL for a term that had not made it through the topics migration (because 187 source terms had failed on curly quotes in the name and the dependent rows were never created), the field was set to empty for that delta. Across 31,000 nodes with a long tail of niche topics, that compounded into 14,000-odd missing references.

The lesson buried in there: migrate:import reports row-level success, not field-level success. A row can be marked Created with every field on it empty and the Migrate API counts it as a win.

Why the staging runs looked clean

We rehearsed the migration twice on staging and both runs reported zero failures. The staging database had been refreshed from a sanitised export six weeks earlier. That export had stripped non-ASCII characters from term names as a pre-flight cleanup, the kind of thing a previous developer had set up and nobody had revisited. So the 187 problem terms with curly quotes only ever existed in production. Our test fixture had nothing to detect.

The takeaway is structural. A staging environment that does not contain the production data's mess is staging in name only. We now run a final cutover rehearsal against a fresh, unmassaged production dump less than 48 hours before go-live, and we treat any pre-flight cleanup script as a suspect rather than a helper.

Detection at 23:40

We were lucky the QA lead opened a page that mattered. Without the manual eyeball, the next signal would have been the launch tweet at 09:00 linking to a page that read "Filed under: nothing."

The first thing we did was confirm the scale. One query against the new database gave us a count:

SELECT COUNT(*) AS nodes_missing_topic
FROM node_field_data n
LEFT JOIN node__field_topic t ON t.entity_id = n.nid
WHERE t.entity_id IS NULL
  AND n.type IN ('article','feature','interview');

The number came back as 14,217. We had ten and a half hours.

Diagnosis at 03:00

Two paths were open. Re-run the migration with a corrected pipeline, which meant migrate:rollback of every dependent migration plus another four hours of import time. Or rebuild the references in place from the source data we still had access to.

Re-running was the textbook answer. It was also a bad fit for the clock. The source database was still live at the client's old host, and the migrate_map_publisher_topics table held the source-ID-to-destination-ID mapping for every topic term that had actually migrated. We could reach the source from the new site by configuring a second database connection in settings.local.php:

$databases['legacy']['default'] = [
  'database' => 'publisher_d7',
  'username' => 'readonly',
  'password' => 'redacted',
  'host'     => 'legacy-db.internal',
  'driver'   => 'mysql',
  'port'     => '3306',
];

A read-only user, because we were not going to risk writing to the source under that kind of fatigue. Before the first INSERT ran on production we also dumped the new database to a separate volume with mysqldump. Ninety seconds of disk, total reversibility if anything went sideways.

The rebuild at 04:30

The plan was three statements. Build a working table that mapped every legacy node to its legacy topic strings, exploded one row per topic. Join that to migrate_map_publisher_topics to translate source topic name to destination term ID. Insert the joined rows into node__field_topic and node_revision__field_topic in the new database.

First statement, run against the legacy database:

CREATE TABLE publisher_d7.legacy_topic_link AS
SELECT
  n.nid AS source_nid,
  n.vid AS source_vid,
  TRIM(SUBSTRING_INDEX(
    SUBSTRING_INDEX(t.field_topic_raw_value, '|', numbers.n),
    '|', -1
  )) AS topic_name,
  numbers.n - 1 AS delta
FROM publisher_d7.field_data_field_topic_raw t
JOIN publisher_d7.node n ON n.vid = t.revision_id
JOIN (
  SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
  UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8
) numbers
  ON CHAR_LENGTH(t.field_topic_raw_value)
   - CHAR_LENGTH(REPLACE(t.field_topic_raw_value, '|', '')) >= numbers.n - 1
WHERE t.field_topic_raw_value IS NOT NULL
  AND t.field_topic_raw_value != '';

The numbers derived table is the standard MySQL pattern for splitting a delimited string into rows without a recursive CTE, which the legacy MySQL 5.7 host did not support. Eight rows is the cap on how many topics any node had, found by running MAX(CHAR_LENGTH(...) - CHAR_LENGTH(REPLACE(...))) first.

Second statement, run against the new database with the legacy connection aliased:

CREATE TABLE drupal10.rebuild_field_topic AS
SELECT
  CAST(m_node.destid1 AS UNSIGNED) AS entity_id,
  CAST(m_node.destid1 AS UNSIGNED) AS revision_id,
  'en' AS langcode,
  lt.delta AS delta,
  CAST(m_topic.destid1 AS UNSIGNED) AS field_topic_target_id,
  0 AS deleted,
  'article' AS bundle
FROM publisher_d7.legacy_topic_link lt
JOIN drupal10.migrate_map_publisher_nodes m_node
  ON m_node.sourceid1 = lt.source_nid
JOIN drupal10.migrate_map_publisher_topics m_topic
  ON m_topic.sourceid1 = lt.topic_name
WHERE m_node.destid1 IS NOT NULL
  AND m_topic.destid1 IS NOT NULL;

The two NOT NULL guards mattered. About 240 of the legacy topic strings did not match any migrated term, which is the same set of curly-quote names that broke the first import. Those would need a manual pass, but they were not in scope at 04:30.

Third statement, the one that actually wrote to production:

INSERT INTO drupal10.node__field_topic
  (bundle, deleted, entity_id, revision_id, langcode, delta, field_topic_target_id)
SELECT bundle, deleted, entity_id, revision_id, langcode, delta, field_topic_target_id
FROM drupal10.rebuild_field_topic r
WHERE NOT EXISTS (
  SELECT 1 FROM drupal10.node__field_topic existing
  WHERE existing.entity_id = r.entity_id
    AND existing.delta = r.delta
);

INSERT INTO drupal10.node_revision__field_topic
  (bundle, deleted, entity_id, revision_id, langcode, delta, field_topic_target_id)
SELECT bundle, deleted, entity_id, revision_id, langcode, delta, field_topic_target_id
FROM drupal10.rebuild_field_topic r
WHERE NOT EXISTS (
  SELECT 1 FROM drupal10.node_revision__field_topic existing
  WHERE existing.entity_id = r.entity_id
    AND existing.delta = r.delta
);

Both tables, because Drupal stores entity_reference fields in two parallel tables, one keyed by current revision and one by every revision. Skip the second one and the references show on the public page but vanish in the editor form, which we discovered the hard way on a different migration in 2021.

After the inserts ran, we rebuilt the entity cache and the search index:

drush cache:rebuild
drush search-api:reset-tracker default_index
drush search-api:index --batch-size=200 default_index

The Search API step mattered because the client's site used a topic facet driven off the reference field. Without re-indexing, the facet would have looked empty even though the page content was correct.

At 06:48 we ran the count query again. Nodes with no topic where the source had a topic: zero. Nodes with no topic where the source also had no topic: 213, the expected real-world empties. We slept for an hour and watched the launch tweet go out on schedule.

Takeaway

migrate:import reports row success, not field success. A green migration is necessary but not sufficient. Validate at the field level before you call a site live.

The 240 terms we left for Monday

The third SQL statement had two NOT NULL guards that quietly skipped 240 legacy topic strings that did not match any migrated term. Those were the curly-quote casualties from the original failed import. On Monday morning we cleaned them up in three steps. A short script normalised the names in the legacy source data, replacing typographic quotes with straight ones. The topics migration was re-run with drush migrate:import publisher_topics --update so the freshly cleaned terms got mapped through. Then we re-ran the second statement from the night before, so the working table picked up the newly mapped term IDs, and did a targeted INSERT for just those nodes.

Twelve nodes had been miscategorised under a near-duplicate term in the meantime, because two source topics had collided to the same straight-quote name after normalisation. The editor caught it in the daily content review and we patched those by hand. A reasonable cost for an otherwise on-time launch.

What we ship now to catch this earlier

Two things have changed in our migrate playbook since this incident.

First, every entity_reference field gets a post-migrate validator that compares legacy row counts to new-site row counts at the field level, not the row level. It runs as a Drush command in the same CI pipeline that runs the migration. A non-zero delta fails the build. Drupal's documentation describes migration_lookup in detail but does not warn about the row-count illusion. The Drupal community has a long-standing issue thread on silent skip behaviour in process plugins that is worth reading before any non-trivial migration.

The validator reads a YAML manifest of field-to-source mappings and runs a count comparison for each entry:

- field: field_topic
  bundle: article
  legacy:
    connection: legacy
    query: >
      SELECT COUNT(*)
      FROM field_data_field_topic_raw
      WHERE entity_type = 'node' AND deleted = 0
  new:
    table: node__field_topic
    where: bundle = 'article'
  tolerance: 0

The Drush command runs each legacy query through the same secondary connection we set up that night, counts rows in the new table with the matching WHERE clause, and exits non-zero on a delta larger than the per-field tolerance. CI runs it as the final step of the dry-run job. A future migration with the same bug pattern fails the build before anyone goes to bed.

Second, the skip_on_empty plugin in the failing pipeline was replaced with a custom process plugin that logs the source ID and field name when a lookup returns NULL. Fifteen lines of PHP. It would have surfaced 14,000 warnings during the dry run.

A useful sanity check we now run before every cutover, borrowed from the audit pass in the Drupal core upgrade docs: pick the ten highest-traffic URLs from the legacy site's analytics, request them on the new site, and diff the rendered HTML word count. A drop of more than five percent on any page is treated as a failure until proven harmless.

Three habits from our existing playbook held up under stress and we kept all of them. The legacy database credentials we used at 03:00 were read-only by default, not because we anticipated this exact scenario but because writeable credentials never get used for inspection work, and any habit you want available at 04:30 has to be the default at 14:00. The mysqldump before the first INSERT cost ninety seconds and gave us a full rollback point on production. And the two-rehearsal rule on staging, while it had not surfaced the bug, had at least proven that the migrate pipeline itself was stable, which let us trust the rebuild path instead of second-guessing it.

The five-minute audit you could run today

If you have a Drupal migration that finished cleanly in the past month and you are not sure whether it told you the whole truth, here is the smallest check that catches this class of bug. For each entity_reference field on a content type that mattered, run:

SELECT
  (SELECT COUNT(*) FROM legacy.field_data_field_topic) AS legacy_rows,
  (SELECT COUNT(*) FROM new.node__field_topic)         AS new_rows;

If the new number is meaningfully smaller and you cannot explain the gap with unpublished or deleted source content, you have the same leak we did. Catch it before the launch tweet does.

When we rebuilt the publisher's reference data that night, the thing that saved us was that the source database was still live and read-accessible. On a more recent legacy migration for a Dutch e-commerce client, we now keep the source database snapshotted on the new host for thirty days after cutover as standard. The cost is a few gigabytes of disk. The benefit is that the next 04:30 SQL session, if it happens, starts with the data already to hand.

Key takeaway

A green Drupal migration report can still hide thousands of empty reference fields, because migrate:import counts rows, not fields.

FAQ

Why did Drupal's migrate command not flag the dropped references as failures?

Because migrate:import reports row-level outcomes. A node row can be Created with an empty reference field if a process plugin like skip_on_empty (method: process) clears the field on a NULL lookup.

Is writing directly into Drupal's field tables safe?

It is safe if you dump first, write to both the current and revision tables for the field, and run drush cache:rebuild plus a Search API reindex. Skip any of those and you will see ghost data.

Could you have re-run the migration instead of writing SQL?

Yes, but only at the cost of a full rollback plus four hours of import time. With a 09:00 launch and a live source database, a targeted SQL rebuild was the faster and safer path.

How do you prevent silent field drops on the next migration?

Replace skip_on_empty with a custom process plugin that logs every NULL lookup, and add a CI check that compares legacy row counts to new row counts at the field level, not the row level.

drupalmigrationmysqllegacy sitesoperationscase study

Building something?

Start a project