Migration

MySQL to Postgres traps: the cheatsheet we use on migrations

Every legacy migration starts the same way. The dump loads, the app boots, then something subtle breaks at 11pm on a Tuesday and you spend two hours finding it.

Jacob Molkenboer· Founder · A Brand New Company· 28 Jun 2024· 7 min

Open leather logbook on ivory paper desk, brass key on tag with twine, green ribbon, red wax seal fragment.

The MySQL dump replays into Postgres. The app boots. The smoke test passes. Then at 11pm on a Tuesday the support inbox lights up because the search bar returns nothing for any name with a 'ç' or an 'ñ' in it. You spend two hours finding the cause, and the migration finished four days ago.

We have shipped this migration enough times now that we keep a private cheatsheet. The list below is what actually trips us up on a MySQL to Postgres move. None of it is exotic. All of it has cost us at least one evening.

ENUMs that will not survive the dump

MySQL lets you declare an ENUM inline in a column definition. Postgres requires the type to exist first, as its own object. pg_dump and pgloader both handle this, but the moment you have a hand-rolled migration script the inline definition silently becomes a varchar with no check constraint, and a year later someone writes 'actve' instead of 'active' into the table.

The fix is to make the type explicit before the table:

CREATE TYPE order_status AS ENUM ('pending', 'paid', 'shipped', 'refunded');

CREATE TABLE orders (
  id          bigserial PRIMARY KEY,
  status      order_status NOT NULL DEFAULT 'pending',
  created_at  timestamptz  NOT NULL DEFAULT now()
);

Adding a value later is a one-line statement (ALTER TYPE order_status ADD VALUE 'cancelled'), but removing or reordering values is genuinely painful. If the column is going to grow new states regularly, skip ENUM entirely and use a lookup table with a foreign key. The Postgres docs are honest about the trade-off.

Zero dates and the silent NULL

MySQL accepts '0000-00-00' as a date by default. Postgres does not, and never has. If your source schema was built before NO_ZERO_DATE became standard, there are rows with 0000-00-00 00:00:00 sitting in created_at columns, and the migration will fail on the first one.

The quick fix is one update on the MySQL side before the dump:

UPDATE users
SET created_at = NULL
WHERE created_at = '0000-00-00 00:00:00';

The longer fix is to ask what those rows actually represent. We have seen 'unknown signup date', 'soft-deleted user', and 'the dev's test row from 2014' all hiding behind the same zero. Treat the zero as a question, not as data.

Warning

Some application code reads 0000-00-00 as a sentinel and branches on it. Grep the codebase before you NULL them out, or the new Postgres app will hit a fresh null-pointer error at the same business logic.

The collation that breaks LIKE

This is the one nobody mentions in the migration guides. MySQL's default collation for utf8mb4 is case-insensitive and accent-insensitive (usually utf8mb4_0900_ai_ci). Postgres defaults to a deterministic, case-sensitive, locale-aware collation. The behaviour of LIKE 'jose%' changes overnight.

'José' no longer matches 'jose'. 'MÜLLER' no longer matches 'muller'. The app code does not throw any errors. The search bar just quietly returns less.

You have three options, in order of how much we like them:

Use ILIKE for case-insensitive matches and accept that you still need to handle accents separately.
Use unaccent on both sides: WHERE unaccent(name) ILIKE unaccent('jose%'). Add a functional index on unaccent(name) or this gets slow at scale.
Create a column with a nondeterministic ICU collation and use it for searchable text. The Postgres collation docs walk through the syntax, and they are clearer than they used to be.

If you skip this step, expect a support ticket from the first user whose name contains a diaeresis.

tinyint(1) and the boolean illusion

MySQL stores booleans as tinyint(1). Most ORMs paper over this. pgloader tries to be helpful and converts tinyint(1) to boolean, which is usually what you want, except when you used tinyint(1) to store a small integer (0 through 9) and now it is a boolean that has lost all values above 1.

Audit every tinyint(1) column before the dump runs. If it actually stores 0 or 1, let the conversion happen. If it stores a small range of integers, change the MySQL column to tinyint with no width hint first, and the conversion will keep it as a small integer on the Postgres side.

AUTO_INCREMENT and the off-by-one

This one bites everyone exactly once. MySQL's AUTO_INCREMENT lives on the table. Postgres uses sequences, which are separate objects. pg_dump handles the conversion when you go Postgres-to-Postgres, but on a MySQL source the sequence value is your responsibility. Migrate the data with a custom script, insert a row by hand to test, and the sequence is still at 1. Your manual insert works. The first real user signup then explodes with a duplicate key.

After any data migration, reset every sequence to the max id in its table:

SELECT setval(
  pg_get_serial_sequence('users', 'id'),
  COALESCE(MAX(id), 1)
) FROM users;

Script this for every table. We keep it as a post-migration step in the runbook and run it twice, once after the bulk load and once after any reconciliation pass.

ON UPDATE CURRENT_TIMESTAMP

MySQL has a column attribute called ON UPDATE CURRENT_TIMESTAMP. Set it on updated_at and the column updates itself on every row write. Postgres does not have this. It has triggers.

You need a function and a trigger per table:

CREATE OR REPLACE FUNCTION touch_updated_at()
RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = now();
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER users_touch_updated_at
BEFORE UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION touch_updated_at();

Or you set updated_at = now() in the application's update statement and stop relying on the database. We default to the trigger because the application layer always forgets eventually, especially when someone writes a raw SQL script for a one-off cleanup.

Case, quotes, and the silent rename

MySQL on Linux is case-sensitive about table names. MySQL on macOS is not. Postgres lowercases every unquoted identifier, so Users and users are the same table, unless you quote them, in which case they are not.

If your MySQL schema has tables like User and UserSession in CamelCase, you have one decision to make at migration time. Either rename everything to snake_case and update the application, or wrap every reference in double quotes forever. We always rename. The first time you forget a quote in a hand-typed query, you will agree.

GROUP BY and the strictness shift

Older MySQL would happily run SELECT id, name, count(*) FROM orders GROUP BY id and return an arbitrary name. Postgres rejects it. Every aggregate query has to list non-aggregated columns in the GROUP BY, or wrap them in an aggregate function.

If the source app was written before ONLY_FULL_GROUP_BY got switched on, the migration will surface a long tail of queries that 'worked' but were undefined. Grep the codebase for GROUP BY before cutover and fix them while you have time, not while users are watching.

The five-minute pre-flight audit

Before you start the actual cutover, run these queries against the MySQL source. They take a minute each and they tell you what is going to break:

-- Inline ENUMs you will need to re-declare as types
SELECT table_name, column_name, column_type
FROM information_schema.columns
WHERE data_type = 'enum'
  AND table_schema = DATABASE();

-- Date columns that might hold zero dates
SELECT table_name, column_name
FROM information_schema.columns
WHERE data_type IN ('date','datetime','timestamp')
  AND table_schema = DATABASE();

-- tinyint(1) columns to triage as bool vs small int
SELECT table_name, column_name
FROM information_schema.columns
WHERE column_type = 'tinyint(1)'
  AND table_schema = DATABASE();

-- CamelCase table names that will need renaming
SELECT table_name
FROM information_schema.tables
WHERE table_name REGEXP '[A-Z]'
  AND table_schema = DATABASE();

-- Largest tables, so you know what will dominate the dump window
SELECT table_name, table_rows
FROM information_schema.tables
WHERE table_schema = DATABASE()
ORDER BY table_rows DESC
LIMIT 10;

Print the output. Tape it to the wall next to the runbook. Every row is a decision someone needs to make before the cutover window opens.

Takeaway

The trap is rarely the data type itself. It is the silent default the old database picked years ago that nobody ever wrote down.

What we do at the start of every migration

When we ran the most recent legacy migration for a Dutch SaaS company moving off a 2014-era PHP and MySQL stack, the thing that ate the most hours was the collation problem. Search worked in staging because staging had US-English names. The first week in production, the support inbox filled up with German and Turkish users whose names did not match themselves. We solved it by switching the searchable columns to a nondeterministic ICU collation and adding the functional indexes after.

The five-minute audit above is the smallest useful thing you can do today. Run it against any MySQL database you are even thinking about migrating, and you will know within the hour whether the cutover takes a weekend or a month.

Key takeaway

The trap is rarely the data type itself. It is the silent default the old database picked years ago that nobody ever wrote down.

FAQ

Should I use pgloader or pg_dump / restore?

Use pgloader for the initial MySQL-to-Postgres pass. It handles type conversions automatically. Use pg_dump for the final reconciliation once you are working Postgres to Postgres.

What about JSON columns?

MySQL JSON and Postgres jsonb differ on key ordering and whitespace. Validate the round-trip on a sample row before bulk-loading, especially if any code compares serialized JSON as strings.

How long should a real cutover take?

Plan for a one-hour read-only window for databases under 100GB if replication is hot. Larger or write-heavy databases need logical replication so cutover is closer to five minutes.

Do I need to change my ORM?

Usually not in the queries it generates. Almost always yes in the migration files. Hand-written migrations assume MySQL syntax and need a second pass before they run on Postgres.

Can I skip the collation fix if my users are all English-speaking?

You can, until you cannot. Names with apostrophes, hyphens, or any non-ASCII character will eventually show up. Fix the collation once, not as a panic patch later.

migrationlegacy sitesmysqlphparchitectureoperations

Building something?

Start a project