How not to change PostgreSQL column type

clarkbw|1 year ago

My tests running ALTER varied from ~20 seconds to ~1 min for the changes.

> Current CI/CD practices often make it very easy for software developers to commit and roll out database migrations to a production environment, only to find themselves in the middle of a production incident minutes later. While a staging deployment might help, it's not guaranteed to share the same characteristics as production (either due to the level of load or monetary constraints).

(neon.tech employee here)

This is where branching databases with production data helps quite a bit. Your CI/CD environment and even staging can experience the schema changes. When you build from a seed database you can often miss this kind of issue because it lacks the characteristics of your production environment.

But the author rightly calls out how staging isn't even enough in the next paragraph:

>The problem is, therefore (and I will repeat myself), the scale of the amount of data being modified, overall congestion of the system, I/O capacity, and the target table's importance in the application design.

Your staging, even when branched from production, won't have the same load patterns as your production database. And that load and locks associated will result in a different rollout.

This has me thinking if you can match the production environment patterns in staging by setting staging up to mirror the query patterns of production. Mirroring like what's available from pg_cat could put your staging under similar pressure.

And then this also made me think about how we're not capturing the timing of these schema changes. Unless a developer looks and sees that their schema change took 56 seconds to complete in their CI system you won't know that this change might have larger knock on effects in production.

radimm|1 year ago

Author here - this is my primary goal, exposing the complexity developer might not even think about. Can't even count number of instances seemingly inconspicuous changes caused incident.

"Works on my DB" is new "works on my machine" (and don't trademark it, please :)))

jihadjihad|1 year ago

Good post, you can tell the author has some burn marks from seeing some sparks in production.

> Let's start with The Bad News - you cannot avoid rewriting the entire table, which will generate a significant amount of WAL files in the process. This is a given, and you must plan how to manage it.

There are some exceptions to this. The blog post focuses on changing a column from int to bigint, which indeed needs a full table rewrite to reallocate 8 bytes per ID value instead of 4. But:

> As an exception, when changing the type of an existing column, if the USING clause does not change the column contents and the old type is either binary coercible to the new type or an unconstrained domain over the new type, a table rewrite is not needed. [0]

So, you could change from varchar to text, for example, or change from varchar(20) to varchar(4), etc. and not incur a full table rewrite.

0: https://www.postgresql.org/docs/current/sql-altertable.html

jbub|1 year ago

"Ensure the column is NULLable and does not have a default value to avoid forcing a full table rewrite."

That one should be fixed since Postgres 11 https://brandur.org/postgres-default

radimm|1 year ago

Thank you for correction! Indeed - it's one of those things i've read but practically avoided it for historic reasons and therefore forgot about it.

junto|1 year ago

Microsoft SQL Server has a similar issue but also similarly when adding a new column to a table, which I found out by accident when trying to deploy a migration that had worked fine in our non-production environment stages.

Adding the column as not nullable and setting a default is the absolute killer.

A work around when you still want to have a default is to do a three pass migration. First add the column as nullable and no default, then run an update to set the default on all rows, and finally change it to not nullable and set the default constraint.

It has a surprising difference in speed.

j-cheong|1 year ago

Changing a PostgreSQL column type without following the author's instructions and just running the following command is VERY anti-pattern. Confused why people do this in the first place.

ALTER TABLE table_name ALTER COLUMN column_name [SET DATA] TYPE new_data_type

>you need to make sure the source system has enough disk space to hold the WAL files for a long enough time

if the asynchronous replication process has an external buffer instead of the WAL, then it addresses this issue

chatmasta|1 year ago

> Confused why people do this in the first place

Probably because every tutorial on the Internet, along with the docs, recommends doing it this way. All the gotchas are buried in the footnotes.

yen223|1 year ago

The unsafe ALTER COLUMN is one step.

The safe option is four steps minimum.

It's not hard to see why people would be tempted by the unsafe option.

mattashii|1 year ago

> Confused why people do this in the first place.

Because you lose a significant amount of performance if you start adding NULL and variable-length columns just because you're afraid of a table rewrite.

Because the resulting table will not have had 1 table of update-induced bloat at the end of the operation.

Because you can be sure the modification is applied atomically and you as the user can be sure the migration from A to B goes through as expected or has a graceful rollback to the old data, rather than getting stuck or failures halfway through the migration.

Because toasted data from DROP-ed columns is not removed from storage with the DROP COLUMN statement, but only after the row that refers to that toasted value is updated or deleted.

...

Every column you "DROP" remains in the catalogs to make sure old tuples' data can be read from disk. That's overhead you now will have to carry around until the table is dropped. I'm not someone who likes having to carry that bloat around.

dgan|1 year ago

Not everyone has billion rows in their tables ¯\_(ツ)_/¯

miohtama|1 year ago

For ” How to Safely Change a PostgreSQL Column Type” (new column, trigger, etc). Is there a particular reason PostgreSQL cannot set up column migration itself out-of-the-box? I have used PSQL for many years and it is always a bit cumbersome to do the column migrations manually, even though the process itself is quite mechanical.

mattashii|1 year ago

> Is there a particular reason PostgreSQL cannot set up column migration itself out-of-the-box?

People haven't asked hard enough to the right people, I suppose. PostgreSQL is an open-source project, where wanted and needed features are supplied by willing individuals and companies, and vetted by the community so that the code base remains in good quality.

I just suppose no-one has bothered yet with implementing ALTER TABLE CONCURRENTLY to the point that it's been accepted by the community, though another reason might be because the induced metadata churn (only 1600 distinct user-defined columns available in each table at most) might become problematic with ADD COLUMN ... DROP COLUMN.

paulryanrogers|1 year ago

So Postgresql and MySQL can both do full table copies as needed. And their locking has gotten better. They also can do more changes in place. Yet still too often they'll need exclusive locks which blocks even reads.

For very big tables it's often better to manually add a column, backfill then drop-swap.

Thaxll|1 year ago

This is pretty much how pt-online-schema-change from Percona works.

Good old MySQL days...

evanelias|1 year ago

My thoughts exactly. It's surprising that external online schema change tools for Postgres have only become a thing fairly recently! The only two I'm aware of are:

* pgroll: Written in Golang, first commits June 2023. https://github.com/xataio/pgroll

* pg-osc: Written in Ruby, first commits Dec 2021. https://github.com/shayonj/pg-osc

Meanwhile over in the MySQL and MariaDB ecosystem, external OSC tools have been around for quite some time, starting with oak-online-alter-table over 15 years ago. The most popular options today are pt-online-schema-change or gh-ost, but other battle-tested solutions include fb-osc, LHM, and the latest entry Spirit.

vrosas|1 year ago

I’m probably alone here and this goes against the HN consensus but, as great a piece of tech Postgres is, I’ve often found it to be a premature optimization to pick as the main DB for newer companies. If you don’t know what your data models are, you’re going to end up in worlds of pain constantly running schema migrations like the OP describes.

unclebucknasty|1 year ago

It's not Postgres, specifically, as much as any SQL or non-schemaless database, right?

And if we're saying that's a problem, then sounds like we're going back into the NoSQL debates from a decade ago.

Hopefully not.

I think it's better to understand your schema as much as possible, and have a sane process for applying changes when needed. Defining a schema forces you to think about what you're building.

OTOH, the idea that developers on a project are just going to throw whatever new attributes they need into a document as they go along is a recipe for bugs and inefficiency. Also, near-instant technical debt, as early work frequently survives longer than anticipated.

You also don't completely escape data changes without pain when using a NoSQL database. If for instance you change a string to an int you'd still need to figure out what to do with existing data, either via conversion or handling in-code.

dewey|1 year ago

I'd say Postgres is very often the right choice for newer company as it's well understood, easy to operate and you know you don't have to switch to a new DB because the vendor got acquired / merged / shutdown after 2 years or is going through growing pains and deprecations themselves.

If you give your schema a good thought (The one place where you shouldn't rush and take shortcuts at the beginning) and for example use jsonb columns and later move data out of it if you notice you need to query on it more performantly you can get very far.

The pain of data model migrations is also usually not that big if the company isn't very large and has a lot of legacy yet.

beeboobaa3|1 year ago

> premature optimization

I'm extremely curious to hear what you consider a better/simpler choice. At least postgres gives you the tools to do schema migrations, and if you're operating at a scale where such migrations become a problem (i.e. probably not for a while) you really ought to know what you're doing.

xboxnolifes|1 year ago

I'm confused by the use of premature optimization here. What exactly are you suggesting is the better default choice?

29 comments