top | item 36904796

(no title)

wfriesen | 2 years ago

I've always used this. Any rows returned means there are differences

  select * from (
    (
      select *
      from table1
      minus
      select *
      from table2
    )
    union all
    (
      select *
      from table2
      minus
      select *
      from table1
    )
  )

discuss

ryzvonusef|2 years ago

I don't think most SQL flavours support MINUS function, imho.

Bing Chat says: > The MINUS operator is not supported in all SQL databases. It can be used in databases like MySQL and Oracle. For databases like SQL Server, PostgreSQL, and SQLite, use the EXCEPT operator to perform this type of query

OskarS|2 years ago

Does this work with the bag/multiset distinction that the author uses? Like, if table1 has two copies of some row and table2 has a single copy of that row, wont this query return that they're the same? But they're not: table1 has two copies of the same row, whereas table2 just has one?

LeonB|2 years ago

I found that a weird edge case for the original author to fixate on. In mathematics or academia sure, but in “real” sql tables, that serve any kind of purpose, duplicate rows are not something you need to support, let alone go to twice the engineering effort to support. Duplicates are more likely to be something you deliberately eradicate (before the comparison) than preserve and respect.

gregw2|2 years ago

I spent time researching this a while back.

The duplicate row issue is part of why I don't use MINUS for table value comparisons, nor RECURSIVE like the original article suggests (which is not supported in all databases and scarier for junior developers)... You can accomplish the same thing and handle that dupes scenario too, with just GROUP BY/UNION ALL/HAVING, using the following technique:

https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_...

It will catch both if you have 1 row for a set of values in one table and 0 in another... or vice-versa... or 1 row for a set of values in one table and 2+ (dupes) in another.

I have compared every row + every column value of billion-row tables in under a minute on a columnar database with this technique.

Pseudocode summary explanation: Create (via group by) a rowcount for every single set of column values you give it from table A, create that same rowcount for every single set of column values from table B, then compare if those rowcounts match for all rows, and lets you know if they don't (sorted to make it easier to read when you do have differences). A nice fast set-based operation.

globular-toast|2 years ago

This is symmetric difference, but still has the problem that it's a set operation whereas in general a table is a bag (multiset).

DougBTX|2 years ago

In theory, yes, however the vast majority of tables will have some form of unique ID in each record... so in practice, there’s usually no difference. But if it must work for all tables...

EddieJLSH|2 years ago

Realistically which production DB tables don't have a unique id? Genuine question, never used one in my life.

unknown|2 years ago

[deleted]