The Migration That Works in Dev and Locks Production
Schema changes look safe in tiny local databases. Production is where table size, lock time, and rollout order turn an ordinary migration into a real incident.
Why database migrations that feel harmless in development can still become production failures, and what teams should think through before touching live tables.
Introduction
Few things feel as unfair as a migration that looked totally harmless right up until it met a real table.
Locally, it ran in under a second. In staging, it barely registered. In production, it grabbed a lock at the wrong moment, backed up requests, and turned a clean deploy into a call nobody wanted to join.
That pattern is incredibly common because most developers learn migrations in environments too small and too calm to teach the dangerous parts.
This article is for developers shipping schema changes on systems that already matter. If you have ever said “it ran fine in dev” about a migration, you are exactly who I am talking to.
The Core Judgment: Migrations Are Operational Changes, Not Just Code Changes
Teams often treat migrations like ordinary implementation details because they live in the repo next to application code.
That is a category mistake.
A schema migration is a live operational event touching:
- lock behavior
- query performance
- rollout order
- reversibility
- data integrity
That means the right question is not only “does this migration work?” It is:
- what does it do to a large table?
- how long can it hold a lock?
- what happens if the app deploy and the migration get out of sync?
- can we recover safely if it fails halfway through?
Development rarely answers those questions honestly.
How This Breaks in the Real World
Most production migration pain comes from one of a few patterns:
- adding or changing columns in ways that force a heavy rewrite
- backfilling data in the same deployment window
- adding indexes at the wrong time or with the wrong strategy
- releasing app code that assumes the schema is already fully changed
All of these can look safe on small datasets.
That is why migration risk gets underestimated. The code is short. The file looks simple. But the database is doing far more work than the diff suggests.
A Real Example: The “Simple” Constraint That Froze Writes
I saw a team add what looked like a reasonable constraint during a normal weekday deploy. The local migration was fast. The staging migration was fast. Nobody felt nervous.
Production had millions of rows and constant write traffic.
The change held a lock longer than expected, upstream writes began queueing, request latency spiked, and by the time people understood what was happening, rollback was no longer clean because application instances were already booting against the new assumption.
The problem was not that the team used migrations.
The problem was that they treated a database change like a harmless code refactor instead of an operational event with timing and scale consequences.
What I Would Do Instead
When I touch important tables, I want the migration plan to be boring and staged:
- prefer additive changes first
- deploy code that tolerates both old and new shapes
- backfill separately when possible
- use online or concurrent strategies where the database supports them
- schedule riskier changes when the blast radius is easier to absorb
The goal is not elegance in one deploy. The goal is survivability.
I would much rather do three smaller, safer steps than one beautiful migration file that turns into a production story later.
Closing
The migration that works in development can still be exactly the one that hurts you in production.
Not because the code was wrong, but because the environment finally told the truth about size, locks, and live traffic.
Schema changes deserve the same kind of operational respect people reserve for infrastructure work.
Because once the table matters, the migration is no longer just part of the codebase.
It is part of the release risk.