Migration

Drupal 7 to Astro: a migration playbook for press URLs

A Volkskrant editor's auto-link tool pulls a 2019 URL from her archive. Three hours later it returns a 404. Here is how to migrate Drupal 7 without losing the press.

Jacob Molkenboer· Founder · A Brand New Company· 6 Jul 2024· 9 min

Open leather logbook on ivory paper, brass key on cream card with green ribbon, iron shipping tag, wax seal fragment.

A Volkskrant editor's CMS pings on a Tuesday morning. The auto-link tool wants the canonical URL for a story written seven years ago about a minister signing an energy covenant. It pulls /nieuws/2019/03/14/minister-tekent-energieakkoord from its archive, drops it into the new article, and publishes. Three hours later, that link returns a 404. The ministry's Drupal 7 site went dark over the weekend, and the new Astro build forgot one of the four URL shapes the Dutch press has been deep-linking to since 2017.

That is the failure mode we spend the most time preventing on government migrations. The press does not read your release notes. Their CMS templates encode your URL structure, and once a journalist's auto-link tool resolves a 404 they do not email you. They just stop linking.

Here is the playbook we use when a 9-year-old Drupal 7 site has to land on Astro plus a thin headless CMS without breaking that contract.

The four URL shapes that must survive

On a typical ministry or agency Drupal 7 site, four URL patterns carry almost all the inbound press value:

/nieuws/YYYY/MM/DD/slug. News items, deep-linked by NRC, Volkskrant and Trouw archives.
/dossier/slug. Long-running topic dossiers, often linked from Wikipedia and from the Tweede Kamer's own portal.
/persbericht/YYYY/NN. Numbered press releases, picked up by ANP and Bloomberg syndication.
/node/[nid]. The raw Drupal node path. Old links to PDF attachments and forgotten subpages still point here.

The fourth one is the trap. Most teams treat /node/[nid] as legacy noise and drop it. Then somebody finds out that an official transcript links to /node/4421 because a clerk pasted the URL bar in 2018.

You inventory all four before you write a single line of Astro. The first SQL pass looks like this:

SELECT
  n.nid,
  n.type,
  n.title,
  n.created,
  ua.alias,
  CONCAT('/node/', n.nid) AS canonical_node
FROM node n
LEFT JOIN url_alias ua ON ua.source = CONCAT('node/', n.nid)
WHERE n.status = 1
ORDER BY n.created DESC;

Run that, dump to CSV, sort by URL shape. You will find aliases that do not match any of the four documented patterns. Those are the items somebody hand-edited in 2014 and forgot. They count too.

Why Astro plus a thin CMS, not the other way around

The reflex on a government migration is to pick a flagship headless CMS first and let the frontend follow. We do it the other way. Astro decides what content shapes are even renderable, then we pick the smallest CMS that fits.

Astro's reason for being on a project like this is that 90% of a ministry's pages are static news items, dossiers, and forms. They change weekly, not per request. Astro's hybrid output lets us prerender every news item at build time and keep a handful of dynamic routes (search, contact form, the occasional logged-in editor preview) on a small Node server.

Build time matters. A Drupal 7 site with 8,000 news items and 400 dossiers builds in Astro in around four minutes on a 4-vCPU runner. That means an editor can publish, see staging in five minutes, and roll forward. No PHP-FPM, no Varnish layer to invalidate, no module update that takes the site down for an afternoon.

The thin CMS sits behind Astro and exposes a typed content API. We default to Directus or Payload depending on the team. Directus when the editorial team wants a Drupal-like admin with collections and roles they recognise. Payload when there is an in-house developer who will live in the schema.

The point of "thin" is that we do not ask the CMS to render anything. No twig, no liquid, no theme layer. It stores content, it speaks JSON, it gets out of the way.

Crawling the Drupal 7 database before touching code

Before we provision anything, we mirror production to a hardened sandbox and crawl it. Three queries do most of the work:

-- Every published node with its body and aliased URL
SELECT
  n.nid, n.type, n.title, n.created, n.changed,
  b.body_value AS body,
  ua.alias
FROM node n
LEFT JOIN field_data_body b ON b.entity_id = n.nid
LEFT JOIN url_alias ua ON ua.source = CONCAT('node/', n.nid)
WHERE n.status = 1;

-- Every file attachment referenced from a node
SELECT
  fm.fid, fm.filename, fm.uri, fm.filemime, fu.id AS nid
FROM file_managed fm
JOIN file_usage fu ON fu.fid = fm.fid
WHERE fu.type = 'node';

-- Every redirect already configured in the Redirect module
SELECT source, redirect, status_code
FROM redirect;

The third one is what most teams miss. A 9-year-old Drupal site has hundreds of internal redirects that the editorial team added over the years to fix typos, to re-route after reorganisations, to stitch in old microsite URLs. Those redirects are part of the contract with the press, even if nobody documented them. They have to come across.

We pipe the three result sets through a small Node script that produces three artefacts: a JSON content tree per node type, a manifest of files to move to object storage, and a flat redirect map keyed by the original URL.

Content modeling against historic URLs

Now we model the CMS. The trick is that the schema is shaped by what the URLs already are, not by what we wish they were.

If /nieuws/YYYY/MM/DD/slug is the canonical, then a news item has a published_at datetime that drives the path, plus the slug. We do not give the editor the freedom to pick an arbitrary URL on a news item. That freedom is what broke the press archive on the previous redesign in 2019, when an intern moved the news section to /actueel.

In Directus this looks like:

// directus/extensions/hooks/news-slug/index.ts
import { defineHook } from '@directus/extensions-sdk';

export default defineHook(({ filter }) => {
  filter('news_item.items.create', async (payload) => {
    const date = new Date(payload.published_at);
    const y = date.getFullYear();
    const m = String(date.getMonth() + 1).padStart(2, '0');
    const d = String(date.getDate()).padStart(2, '0');
    payload.path = `/nieuws/${y}/${m}/${d}/${payload.slug}`;
    return payload;
  });
});

The path is computed and stored. The editor sees it but cannot edit it. Astro reads it verbatim and renders the page at that exact route.

For /node/[nid], we keep a column legacy_nid on every imported collection. There is no editor surface for it. It exists so the redirect map can resolve.

The redirect map as the source of truth

Every URL the Drupal site ever served goes into one file. We call it redirects.json and it lives in the Astro repo, version-controlled, reviewed in PR.

[
  { "from": "/node/4421", "to": "/nieuws/2018/06/12/tk-debat-energie", "status": 301 },
  { "from": "/oud-dossier/klimaat", "to": "/dossier/klimaat", "status": 301 },
  { "from": "/persbericht/2017/89", "to": "/persberichten/2017/89-akkoord-getekend", "status": 301 }
]

In Astro we wire it through a middleware that fires before any route matches:

// src/middleware.ts
import { defineMiddleware } from 'astro:middleware';
import redirects from '../data/redirects.json';

const map = new Map(redirects.map(r => [r.from, r]));

export const onRequest = defineMiddleware(async (ctx, next) => {
  const hit = map.get(ctx.url.pathname);
  if (hit) {
    return ctx.redirect(hit.to, hit.status);
  }
  return next();
});

On the hosting side (Vercel, Netlify, or self-hosted with Caddy) we mirror the same file into platform-level redirects so they fire at the edge without invoking the Astro runtime. The middleware is a safety net for paths the build forgot.

Warning

If you generate the redirect map from the database once and forget, you will miss the 200 to 400 aliases that editors added in the last six weeks before cutover. Re-run the crawl on the day of go-live, regenerate, redeploy. We have lost a press archive by skipping this step once.

Verifying the press archive before cutover

We do not trust our own redirect map. Three weeks before cutover we assemble a list of known inbound URLs from the Dutch press archive. NRC, Volkskrant, Trouw, ANP, RTL, NOS. We pull the last five years of links pointing to the ministry's domain from publicly available search APIs and from the client's own server logs.

That list, typically 4,000 to 9,000 URLs, becomes the test suite.

#!/usr/bin/env bash
# verify-press-links.sh
while IFS= read -r url; do
  code=$(curl -s -o /dev/null -w "%{http_code}" -L "https://staging.example.nl${url}")
  final=$(curl -s -o /dev/null -w "%{url_effective}" -L "https://staging.example.nl${url}")
  printf "%s\t%s\t%s\n" "$code" "$url" "$final"
done < press-links.txt | tee press-link-report.tsv

We grep for anything that is not a 200 after redirect, and for any 301 chain longer than one hop. Long redirect chains tank Core Web Vitals and confuse the Google News indexer that ministries care about. The rule we ship is: one redirect, then a 200. No exceptions.

This is also a values question. Tim Berners-Lee laid down the rule in 1998 in Cool URIs don't change, and a quarter-century later it is still the cleanest argument for treating the redirect map as a public commitment, not a developer chore.

Cutover without breaking the press archive

Cutover happens on a Friday night, between 22:00 and 02:00 CET. The order:

Freeze editorial in Drupal at 21:00. Editors get an email.
Final crawl of url_alias, redirect and node tables. Regenerate redirects.json. Push to main.
Trigger production build of Astro. Watch the build run.
Swap DNS at the CDN level. TTL has been at 300 seconds for 48 hours.
Re-run press-link verification against production. 4,000 to 9,000 curls. Takes about 12 minutes.
If any URL returns non-200, roll DNS back. We have rolled back twice. Both times it was a single dossier whose alias had a trailing-slash mismatch.

The original Drupal stays warm and read-only for two weeks. We keep a reverse proxy rule that routes old-cms.example.nl to the old box, accessible only from the editorial team's IP range. That gives the comms department a fallback they can screenshot if something looks wrong.

What we measured the morning after

The morning-after dashboard is three numbers.

First, the 301 hit rate on the redirect map. Edge logs filtered to status 301, grouped by source URL. If a path in redirects.json is getting zero hits after a week, it was probably never linked. If a path not in redirects.json is getting hits and returning 404, it is a miss we need to add.

Second, time-to-first-byte for the top 50 news items by historic traffic. Astro static output should sit under 80ms from a Dutch CDN edge. Drupal 7 with Varnish on a good day did 180ms. On a bad day, 1.2 seconds.

Third, Google Search Console coverage. Indexed pages should match within 5% of the pre-migration count by day 14. If it diverges, the canonical tags or the sitemap are off.

We do not celebrate until day 30. That is when the press archive cycles long enough for a missed URL pattern to become loud. If day 30 is quiet, the migration held.

The smallest thing you can do today

Open your current site's Redirect module table (or Apache rewrite config, or nginx conf), export every rule to a CSV, and grep your last 90 days of access logs against it. Anything in the logs but not in the CSV is a URL you do not know you are serving. That list, however short, is the first page of your migration playbook.

When we moved a national agency off Drupal 7 onto Astro and Directus last winter, the thing we underestimated was how much of the redirect map lived in editors' heads and not in the database. We ended up running two weeks of editor interviews before the crawl made sense. Our legacy migration practice grew out of that mistake.

Key takeaway

Your press archive lives in URL patterns the editors do not remember setting. Treat the redirect map as the public contract, not the afterthought.

FAQ

Why Astro and not Next.js for a government migration?

Astro's island architecture prerenders the 90% of news and dossier pages that change weekly, not per request. That matches government content patterns and keeps build and TTFB cheap.

Which CMS do you pair with Astro on these migrations?

Directus when editors want a Drupal-like admin with collections and roles, Payload when there is an in-house developer who will live in the schema. Both expose typed JSON only, no theme layer.

How long does a Drupal 7 to Astro migration take?

For a 5,000 to 10,000 node site with the press URL contract, six to ten weeks. Two of those are editor interviews and database crawls. The rest is modeling, build, and verification.

What breaks most often on cutover?

Trailing-slash mismatches on aliased URLs and the editor-added redirects from the last six weeks before freeze. Re-crawl the redirect table on the day of go-live and redeploy before the DNS swap.

drupalmigrationlegacy sitesphparchitecturecase study

Building something?

Start a project