SQL Join Types Explained in Visuals

[+] twic|4 years ago|reply

This seems like a terrible explanation.

Firstly, joins are substantially orthogonal to keys. You can join two relations with no keys at all. Keys might help you make sure that there are going to be results to your join, but they are not required, and i wouldn't start with them.

Secondly, union is not a join. If mentioned at all, it should be in an addendum about other ways to combine tables.

Thirdly, left and right joins are specific kinds of outer join. Surely it's much clearer to describe "left, right, and full outer joins" than "left, right, and outer joins".

Fourthly, surely the most logical way to explain joins is the progressive refinement cross -> inner -> equi -> natural -> full outer -> left/right outer. A cross join is a Cartesian product, simple to explain. Restrict that with a predicate and you have an inner join (i'd use an example where the predicate is not equality). Introduce ON as a synonym for WHERE in inner joins. Make the predicate equality on a set of the columns and it's an equijoin. Introduce USING as a shorthand for ON in equijoins. Make that set all the corresponding columns and it's a natural join. Now include all the rows that didn't match and it's a full outer join. Note that WHERE wouldn't work here. Now just the rows on one side or the other and it's a left/right outer join.

[+] gopalv|4 years ago|reply

> A cross join is a Cartesian product

I think that's the most important detail that can't be explained visually like this.

Cartesians explode the row-counts - 10 rows x 10 rows is 100, not 10 or less than (which is the assumption for most people as they write joins with a primary key on at least one side).

If you have a cartesian and add on the additional entry needed for the null in outers, you can explain the output of the join as a filter over that set (full outers make more sense in that model).

In fact, that's a good way to think of when you see WHERE used to produce an inner join (cartesian + filter == inner).

Sometime ago, I had to optimize pure cartesian joins which had a user-defined function like ON overlaps(time_range1, time_range2) and this involved a deep-dive[1] into distributing a cartesian join to parallelize the filtering.

[1] - http://people.apache.org/~gopalv/xldb18/tez-xprod.pdf

[+] paulddraper|4 years ago|reply

1. They are also missing antijoins. There's not a dedicated SQL keyword for it, but I think it's worth mentioning (especially if you're going to talk about union/intersect/except).

2. Actually -- and I suppose I can't blame the article -- but from title I was expecting to see HashJoin, MergeJoin, NestedLoopJoin, etc. I.e. join implementations not specifications.

[+] radiospiel|4 years ago|reply

I really don't get the confusion around JOINs...

Any JOIN is what the author is calling a "CROSS JOIN", with "ON" (which is optional) basically being a WHERE condition, and "LEFT", "RIGHT", "INNER", "OUTER" words only describing whether or not to drop a result if the left/right side of the join would be a NULL reference.

And, of course, a "UNION JOIN" is not a join at all, because nothing is joined.

[+] darksaints|4 years ago|reply

There is a semantic difference between a `where` clause and an `on` condition. If you're intending to do a left|right|full outer join, filtering via the `on` condition versus the `where` clause can produce drastically different results, even if the predicates used are exactly the same.

[+] da_chicken|4 years ago|reply

> I really don't get the confusion around JOINs

Nearly all programmers get locked into thinking in terms of sequential, imperative programming, and the big differences are procedural vs OOP. They think that's always what programming looks like. They never learn to think in any other terms. They think functional programming is a weird outlier and looks like Lisp.

It's not entirely their fault. This is all they're typically taught in school. They're taught they don't need to think any other way than how Java, C, or C# tells them they should.

So they're not used to thinking in terms of sets. They're not used to declarative programming. Indeed, they'll often argue that anything other than general purpose, imperative programming isn't "real" programming. That's why Professor Brailsford from Computerphile got so much flack for saying that HTML was a programming language.

[+] magicalhippo|4 years ago|reply

I taught myself SQL on a need-to-know basis. One thing that blew my mind and really helped making certain non-trivial queries more manageable was when I realized I could join subqueries. It never occurred to me to try, until one day I did and it worked.

A contrived example for illustration:

    select oh.oid, oh.order_ref, ol.num_lines, ol.total_value
    from order_head oh
    join (
      select oid, count() as num_lines, sum(value) as total_value 
      from order_lines
      where status = 'ready'
      group by oid
    ) ol on ol.oid = oh.oid
    where total_value > 500

For certain queries this has led to huge speed-ups, by allowing rows to be filtered sooner rather than post-massive-join. For others it makes it easier to integrate with libraries, as aggregate fields like in the example can be filtered using the main where clause rather than in a having clause.

[+] pak9rabid|4 years ago|reply

Yep, I've used this with huge success in PostgreSQL when needing to join the contents of tables containing potentially millions of records that could (usually) then be filtered via search criteria. Basically I'd do my search-filtering first via a sub-query, then simply join the result of that vs doing it in the opposite order. The performance gains were orders of magnitude.

[+] Diggsey|4 years ago|reply

It's a nice visualization, but it assumes the keys on both sides are unique. Usually the key on at least one side of a join is not unique, and sometimes it's not unique on either side. The behaviour in these cases is an important part of the behaviour of the overall join.

[+] da_chicken|4 years ago|reply

I understand what you're driving at and I think you know this, but your terminology sounds confused to me. I don't think you're wrong, I just wouldn't phrase things the way you have.

At least in RDBMS terms it's not a key if it's not unique. "Key" means "a set of columns whose values uniquely identify a record in the table".

You might have a column which is a key in another table, a.k.a., a foreign key. But a foreign key is [nearly always] not itself a key in the table it's foreign in. You might also have a composite key with a subset of fields that are foreign keys, too. But I wouldn't describe a foreign key as a key. It's unique in another table but not unique to this one. It's a key there, but not a key here.

I would say that in nearly all cases in a well-design system, a proper JOIN condition is going to involve the key of either the left or right table. It's almost always incorrect not to have at least one JOIN condition involve a key. You'll usually know when not to do that because you know you're intentionally ignoring the data model.

I do agree that the visualization doesn't adequately communicate the row "duplication" that often happens and which new SQL users often find confusing or undesirable.

[+] contravariant|4 years ago|reply

It's a bit dangerous to visualize Union like that, people might confuse it with concatenation which it's not.

You'll want to use "UNION ALL" if your intent is to concatenate the two sets of rows, and not to calculate the (deduplicated) union. If only because it's much faster if your database doesn't have to check for duplicates.

[+] ProAm|4 years ago|reply

I've always liked Jeff Atwoods visual explanation of joins [1]

[1] https://blog.codinghorror.com/a-visual-explanation-of-sql-jo...

[+] s3r3nity|4 years ago|reply

That's a great article - however my issue with Atwood's explanation is in the classic use of overlapping circles to describe joins.

After ~2 decades of using SQL in some capacity, the circles metaphor still makes no fucking sense to me.

Dataschool's article takes a somewhat non-orthodox approach with coloring rows / columns that would be included in the results of the joins - which I appreciate a lot more.

[+] major505|4 years ago|reply

https://i.pinimg.com/originals/0e/ae/a1/0eaea116051e07f24314...

Still the best tutorial.

[+] hactually|4 years ago|reply

So much better than the usual Venn diagram mess.

Props to the author

[+] flowerlad|4 years ago|reply

Not discussed in this article: 1-many, 1-1 and many-many relationships. Also: Fan traps and chasm traps. These issues are important to understand, to avoid serious mistakes.

[+] Serow225|4 years ago|reply

thanks, I hadn't heard of the traps before. Cheers

[+] 725686|4 years ago|reply

Just google "sql joins diagram" and you will be rewarded by a quick one image visual representation of each type of join. Here is one in stackoverflow: https://stackoverflow.com/questions/13997365/sql-joins-as-ve...

[+] statmapt|4 years ago|reply

That is a smooth site. I like how you can suggest edits to their articles right in google docs.

What framework/service is being used to build it?

[+] b0tch7|4 years ago|reply

Off Topic slightly and perhaps deserves a "Ask HN:" post - but I'd love any recommendations for best resources to quickly get to "medium proficiency" with SQL. Thanks!

[+] AtillaBosma|4 years ago|reply

Good, clear explanation of these SQL joins. Works well as a quick way to mentally visualize what each join would do in your own implementation.

Thanks for the learning resource.

[+] ohashi|4 years ago|reply

Nice visualization. Never heard/used cross join. What would be use case for something like that in practice?

[+] Xcelerate|4 years ago|reply

I mostly use CROSS JOIN when I need "one of every combination".

Suppose we have a log of events, and each event has a user, a date, and an event action associated with it. Now I want to transform this log into a tabulation, where I have one record for each combination of user, date, and action type with an associated count. I can do something like:

    WITH tabulation AS (
        SELECT
            user,
            action,
            date,
            COUNT(*) AS num_events
        FROM events
        GROUP BY
            user,
            action,
            date
    ),
    combinations AS (
        SELECT
            users.user,
            actions.action,
            dates.date
        FROM (SELECT DISTINCT user FROM tabulation) users
        CROSS JOIN (SELECT DISTINCT action FROM tabulation) actions
        CROSS JOIN UNNEST(
            SEQUENCE(
                (SELECT MIN(date) FROM tabulation),
                (SELECT MAX(date) FROM tabulation)
            )
        ) AS dates(date)
    )
    SELECT
        user,
        action,
        date,
        COALESCE(num_events, 0) AS num_events
    FROM combinations
    LEFT JOIN tabulation USING (user, action, date)

[+] da_chicken|4 years ago|reply

They're fairly rare, but they do come up. It allows you to very quickly create data, and since it's multiplication it can very quickly create a lot of data. It's mostly a way of thinking about things rather than something you actually do.

Usually they can be written as an INNER JOIN because what you really want is a non-standard join or a partial CROSS JOIN. You want a Cartesian product. For example, maybe you want to join an employee table to a calendar table to produce a list of dates for each employee. Usually you want a subset of the calendar though, so you end up with something like:

  SELECT *
  FROM Employee e 
  INNER JOIN Calendar c 
      ON c.Cal_Date BETWEEN '2021-01-01' AND '2021-12-31'

They also come up when you define an inline table with a VALUES statement, which is like creating a virtual table in a query. So if you want one row for M-F, you could do something like this:

  SELECT *
  FROM Employee e
  CROSS JOIN (VALUES ('M'),('T'),('W'),('R'),('F')) d (Day_of_week)

Note that we have to give the system an alias for the "table" we're making and a name for the column we created, too.

Another common use is with a tally table, which is a utility table that stereotypically has a single column with integers from 1 to some arbitrary value (10,000 is common). It's got many uses. Say you want to look for non-English letters in names. You have a UNICODE() function that returns the character code of a single character, and a SUBSTRING() function that returns a substring of a character field.

  SELECT *
  FROM Employee e
  INNER JOIN Tally t
      ON t.N BETWEEN 1 AND LEN(e.Name)
  WHERE UNICODE(SUBSTRING(e.Name,t.N,1)) NOT BETWEEN 32 AND 126

Note again that it's not actually written as a CROSS JOIN here, but it's the same line of logic that leads to CROSS JOINs. We're creating a Cartesian product. You could write it as a CROSS JOIN, but this method makes it a bit more readable to understand the logic.

(The above is written with SQL Server or T-SQL. Other RDBMSs have different syntax, but can generally do the same things.)

[+] denton-scratch|4 years ago|reply

Cross Join is also called Cartesian Join.

A (vague, handwavy) use case would be if you had a table of Towns, and a table of Pizza_Chains; by iterating over the joined table, you would see every permutation of Towns and Pizza_Chains, and (by inputting address info) you could produce a new table giving the address for each Pizza_Chain in each Town.

The cross-join would be used once, to create the new table. You want to use cross joins as little as possible, so a single-use cross join is reasonable. I admit it's a pretty stretched example; I suspect a better example might be found in a more abstract field, like maths, but I'm not feeling abstract enough right now to work one out.

You make a cross join like this:

SELECT Towns.id, Towns.name, Pizza_Chains.id, Pizza_Chains.name FROM Towns, Pizza_Chains

I.e. you simply don't have a WHERE clause.

I don't think I've ever had occasion to use a cross join. It's not a very useful query, because the resultset doesn't give you any information - or rather, it tells you every possible result you could get, if you tried the query with every possible WHERE clause.

[+] dgb23|4 years ago|reply

When you want to view (or display) all possible combinations.

It doesn't make much sense for example with relations that describe things that happened or are (facts). For example you wouldn't cross join a users and addresses table.

But when relations are composable, say a product with selectable features, then it can make sense to cross join.

An example:

You ask a user a few questions and want to suggest a solution or product based on the combinations of the answers. The suggestions are provided by domain experts in advance.

In this case you might want to use a cross join to provide a matrix of all combinations to your domain experts. They can fill in valid suggestions for some and invalid ones for others.

Based on their input you can now automatically generate a questionnaire.

[+] systems|4 years ago|reply

cross joins create data (facts) that dont exist in real life

you will probably never use them, except in rare cases for example

return employee sales by products, show 0 if no sales in a normal db, you will only have records for products with sales above 0

you will need a cross join to join every employee with all products even those with no sales

and full outer join will not work, because it will only return products with some sales for at least one employee

only a cross join will return all products

i will do this in two steps 1 cross join emp and product left join the result from 1 to product sales replace nulls with 0

rare case, stupid case too

[+] pdubs|4 years ago|reply

I've only used them while using certain frameworks to dynamically generate queries referencing multiple tables. They end up getting turned into an implicit inner join though. The following queries are equivalent.

  SELECT a.*, b.*
  FROM a CROSS JOIN b
  WHERE a.id = b.a_id
  
  SELECT a.*, b.*
  FROM a, b
  WHERE a.id = b.a_id

  SELECT a.*, b.*
  FROM a INNER JOIN b ON a.id = b.a_id

IIRC .NET's Entity Framework will (or at least used to) generate queries similar to this as well.

[+] chiyc|4 years ago|reply

I've only used it a few times, but it was usually to combine data from two tables that didn't have a column I could join them on.

[+] twic|4 years ago|reply

Enumerating a cafe menu:

  select *
  from
    (values ('egg'), ('sausage'), ('spam')) main,
    (values ('baked beans'), ('chips'), ('spam')) side;

Of course, to get the complete menu, you would need to use a recursive common table expression.

[+] friendzis|4 years ago|reply

In Oracle in some cases cross join works as cross apply, so you can cross join table A with table-returning structure and have it compute column based on table A.

[+] jackfoxy|4 years ago|reply

I like this one page visual for explaining joins much more succinctly https://www.codeproject.com/KB/database/Visual_SQL_Joins/Vis...

[+] mutagen|4 years ago|reply

Venn diagrams accurately describe set operations but most joins in common use in SQL are filtered Cartesian products.

https://blog.jooq.org/2016/07/05/say-no-to-venn-diagrams-whe...

[+] killingtime74|4 years ago|reply

Missed self join

53 comments