Upsert in SQL | WingNews

[+] petergeoghegan|2 years ago|reply

> insert...on conflict is a fine way to do upserts on older PostgreSQL versions. For modern ones (15+) there is a better alternative [SQL Standard: MERGE]

This is incorrect. To quote the Postgres MERGE docs:

"When MERGE is run concurrently with other commands that modify the target table, the usual transaction isolation rules apply; see Section 13.2 for an explanation on the behavior at each isolation level. You may also wish to consider using INSERT ... ON CONFLICT as an alternative statement which offers the ability to run an UPDATE if a concurrent INSERT occurs. There are a variety of differences and restrictions between the two statement types and they are not interchangeable."

MERGE can give duplicate violation errors when concurrent inserts take place (at the default READ COMMITTED isolation level). MERGE is no more capable of avoiding these sorts of race conditions than totally naively application code. In short, upserting is precisely what you're not supposed to use MERGE for -- ON CONFLICT is expressly designed for that use case.

[+] kardianos|2 years ago|reply

I agree. In SQL Server, merge comes with a number of quirks that cause me more pain when I try to use it. The ONLY reason I use MERGE today, isn't to merge, but to include non-inserted columns in the OUTPUT clause; only the merge statement can do that. This can be useful when you want to output a lookup from a source to a newly inserted identity (auto-increment) column.

Generally, stay away from MERGE. At least in SQL Server.

[+] singingfish|2 years ago|reply

I've done a lot of work on MERGE queries recently. For postgres I suspect that self updating CTEs that report back what actually happened are maybe a better idea.

[+] chasil|2 years ago|reply

MERGE is standardized in ANSI SQL 2003, and updated in 2008:

https://en.wikipedia.org/wiki/Merge_(SQL)

No question, everyone should use standardized, cross-platform syntax if at all possible.

Maybe someday ANSI will standardize some kind of upsert syntax; until that time use upsert only if you are explicitly tying your SQL to your particular database.

[+] MarkusWinand|2 years ago|reply

I recently covered MERGE on modern-sql.com: https://modern-sql.com/caniuse/merge

There I also look at limitations of some implementations and problems such as not reporting ambiguous column names — just guessing what you mean ;)

[+] bafe|2 years ago|reply

Great work as always! I still remember looking at your site when I was working on an ETL system for sensor data and realising in frustration how MariaDB was missing all the nice modern SQL functions like merge.

[+] sebazzz|2 years ago|reply

I've noticed that MERGE in MSSQL can have a performance penalty as opposed to have separate insert and update statements.

[+] swasheck|2 years ago|reply

mssql merge has a lot of problems. it's good for the most basic and straightforward of use cases, but it struggles with a variety of features:

https://michaeljswart.com/2021/08/what-to-avoid-if-you-want-...

[+] NicoJuicy|2 years ago|reply

In ms sql you can do an update query, check the rowcount and if it's 0, then do an insert.

That's how I got an improved query performance for syncing data.

[+] gregw2|2 years ago|reply

The problem with plain/primitive upset/merge operations is they don’t always support handling of what should happen when a row is deleted from the source. People new to data engineering don’t always recognize this.

Im those cases, one needs “delete-then-upsert” or “soft-delete-then-upsert” in your upset/merge implementation. I was a tad disappointed to see this only peripherally alluded to.

With some flavor of merge implementations you can do this with a pair of merge calls but in my opinion, ideally it should be one merge operation and db vendors should wise up to this.

Per https://modern-sql.com/caniuse/merge only SQL Server and BigQuery currently support “when not matched by source”. But Snowflake lets you have multiple “when not matched” clauses with “and” expressions so perhaps you can do it on that. Redshift docs show examples of using two merge statements.

[+] hn_throwaway_99|2 years ago|reply

> The problem with plain/primitive upset/merge operations is they don’t always support handling of what should happen when a row is deleted from the source.

I didn't understand what you meant by this - you seem to be referencing a larger workflow (e.g. keeping 2 tables in sync), and I'm not understanding how is relevant to the "upsert" primitive. Can you give an example?

[+] bakugo|2 years ago|reply

Something I find somewhat annoying about MySQL's INSERT...ON DUPLICATE KEY UPDATE is that it increments auto-increment columns even if no new row is inserted. Does anyone know if Postgres' ON CONFLICT and MERGE do the same?

[+] sbuttgereit|2 years ago|reply

I expect this behavior to be pretty common. In PostgreSQL, auto incrementing keys like this are backed by sequences.

Sequences, at least in the database products I'm familiar with, do not exhibit transactional behavior; I'm not sure if this is actually part of the standard, but if not, I at least expect this to be pretty common. Part of the reason for this is to avoid introducing a point of serialization in concurrent transaction processing. If we were to do what you're asking for, you can get quite a performance hit. Consider a long running transaction involving an insert into a table using an auto-incrementing key. Any other transaction that wanted to get an ID would have to wait for that long running transaction to finish to get a number from the sequence... the waiting transaction wouldn't know if the long running transaction was going to commit or rollback and so the sequence number would be undefined until the waiting transaction could get its own lock.

If we set forth the expectation that a sequence will only guarantee you that you'll get a unique number from within the range, but that any other desire like gapless numbering is out of scope for the sequence you can avoid that whole waiting game. Sure, you might have gaps as a result; if the long running transaction finally rolls back, you'd get a gap for the number it consumed.... but no one waited to find out if that was the case and all the records got unique IDs.

If you need to ascribe meaning beyond merely "unique identity" to such a record ID, it might be that an auto-sequencing is simply the wrong tool. I think the compromise and expectation that sequences, and extended functionality depending on sequence mechanics, are not transactional is a good one given the performance costs of ensuring gapless numbering whether you really need it or not.

[+] andorov|2 years ago|reply

worth re-upping -

postgres requests an id from the sequence for each row of the incoming data ahead of time since it doesn't know which rows are updates and which are inserts (although presumably this could be changed?). the sequence doesn't reset down for the unused so this can eat through it unexpectedly quickly if you have a table with a relatively large volume of updates.

also as a tip if you hit the max integer for the sequence and need space to implement a fundamental fix you can quickly change the sequence to start at -1 and go down. there's no issue with negative ids since they're also integers.

[+] dewey|2 years ago|reply

PG upsert also increments the counter.

I somehow feel it's wasteful and feel a bit conflicted. I also realize that while it feels natural to somehow get a feeling for the row numbers and it can help with debugging because you somewhat know some IDs / ranges after a while it's probably more something I just got used to and maybe shouldn't rely on.

[+] tucnak|2 years ago|reply

Yes, unfortunately it does and I don't think you can work around this...

[+] sixbrx|2 years ago|reply

Good article, but I wouldn't characterize MERGE as strictly better than the older "ON CONFLICT ..." for Postgres, because MERGE doesn't support any sort of RETURNING clause.

[+] keosak|2 years ago|reply

INSERT ... ON CONFLICT has a problem with CHECK constraints. The constraint must be satisfied in the INSERT tuple even if the row already exists and UPDATE will be executed. I recently dealt with this at work and the resulting CTE query was very similar to MERGE, except MERGE has better syntax.

[+] Nezteb|2 years ago|reply

I found a post on the mailing list about this: https://www.postgresql.org/message-id/CANbhV-EXmw9i4YhJTgsrL...

I'm not sure how to go about making an official feature/change request though.

[+] majkinetor|2 years ago|reply

Yeah, we were happy to see MERGE in latest postgres but abandoned it after we found there is no RETURNING...

[+] polygotdomain|2 years ago|reply

The challenge with UPSERTs is ambiguity, and the variety of syntax options you see on the page indicates different attempts to try to address that ambiguity while still keeping a simple syntax.

Personally, I work with MSSQL which has skipped adoption of UPSERT syntax and supported MERGE statements for probably 15 years now. While there's a certain degree of complexity in a MERGE statement, there's also a strong level of explicitness that is helpful in making sure certain conditions don't fall through the cracks or don't perform an operation that isn't expected.

The ability to handle different conditions in match statements is incredibly powerful and allows potentially complex ETL statements to be handled within a single merge statement. I'm not saying that's always the best practice, but if things are structured in the right way, it may be.

I believe the transaction isolation of a MERGE statement (in MSSQL at least) means that the lock on the rows is not released between the read and the write, which means you'll have fewer issues with reading data, attempting an update, then having it fail bc another process made a change in between the two calls.

I regularly use output tables to debug the results of a MERGE statement, and in this case the $action keyword is incredibly helpful for knowing what was done.

[+] tabtab|2 years ago|reply

> allows potentially complex ETL statements to be handled within a single merge statement. I'm not saying that's always the best practice, but if things are structured in the right way, it may be.

I wonder how often one has the ideal schemas for such. The advantage of an ETL is that it's probably easier to debug, log, and trouble-shoot problems. One-Giant-Do-All-Command is often a recipe for headaches. Being able to "X-Ray" the intermediate pipes really helps troubleshooting.

The X-ray issue is one reason I have trouble debugging functional programming. Maybe there's a way to X-ray better, but I haven't found it yet.

[+] TheCycoONE|2 years ago|reply

It's been many years since I worked at a MSSQL shop but at least at the time (2008R2) there were issues with MERGE upsert statements deadlocking with themselves when run concurrently.

Based on https://dba.stackexchange.com/questions/187776/merge-deadloc... it seems that's still an issue as of 5 years ago, and not the recommended way of doing an upsert on MSSQL.

[+] compumike|2 years ago|reply

Is anyone aware of performance benchmarks comparing Postgres's new MERGE to the old INSERT ... ON CONFLICT?

The "WHEN NOT MATCHED THEN DO NOTHING" is very appealing, and we currently implement this with a transaction and a DELETE after the upsert because we want to do updates only, not unintentionally insert new rows.

[+] hn_throwaway_99|2 years ago|reply

> we currently implement this with a transaction and a DELETE after the upsert because we want to do updates only, not unintentionally insert new rows.

Huh??? Why not just do an UPDATE then? After all, you seem to be saying "we want to do an upsert but without the insert part".

Edit: I'm also trying to understand the value of "WHEN NOT MATCHED THEN DO NOTHING". How is that any different that just leaving off the "WHEN NOT MATCHED" clause altogether?

[+] gregw2|2 years ago|reply

FYI, article doesn’t say it but Redshift nowdays supports MERGE: https://docs.aws.amazon.com/redshift/latest/dg/r_MERGE.html

[+] victor106|2 years ago|reply

Great explanation,

I highly recommend the author’s book

https://antonz.org/sql-window-functions-book/

[+] paulddraper|2 years ago|reply

Excellent article.

I did not realize that PostgreSQL now supported MERGE

[+] LorenPechtel|2 years ago|reply

That will save me a few headaches--I wasn't aware of the MYSQL ability to reference the values in the update part.

[+] ZiggerZZ|2 years ago|reply

An old Python package with the implementation of MERGE in BigQuery - https://github.com/ZiggerZZ/pygbq

[+] yellow_lead|2 years ago|reply

Its annoying that a standard upset cannot be part of SQL.

[+] hn_throwaway_99|2 years ago|reply

But MERGE is now the upsert that is part of the SQL standard.

[+] gomezjdaniel|2 years ago|reply

Thanks for the article.

You have ton of interesting posts in your personal website!

[+] mikece|2 years ago|reply

I don't know how recently the "UPSERT" nomenclature came into use but I laughed the first time I heard the term used several years ago, thinking it was a verbal contraction for a concept that, while logical, didn't actually exist in database engines (my ignorance was quickly remedied after my expression of mirth). It certainly wasn't part of the SQL courses back in 1998.

55 comments