Joins 13 Ways | WingNews

[+] bob1029|2 years ago|reply

Once I started thinking about joins in terms of spatial dimensions, things got a lot easier to reason with. I like to think of the inner join like the scene from Stargate where they were describing how gate addressing works. Assume you have a contrived example with 3 tables describing the position of an off world probe in each dimension - X, Y and Z. Each table looks like:

  CREATE TABLE Dim_X (
    int EntityId
    float Value
  )

Establishing a point in space for a given entity (regardless of how you derive that ID) is then a matter of:

  SELECT Dim_X.Value AS X, Dim_Y.Value as Y, Dim_Z.Value as Z
  FROM Dim_X, Dim_Y, Dim_Z
  WHERE Dim_X.EntityId = Dim_Y.EntityId --Join 1
  AND Dim_Y.EntityId = Dim_Z.EntityId --Join 2
  AND Dim_X.EntityId = @MyEntityId --The entity we want to find the 3D location of

You will note that there are 2 inner joins used in this example. That is the bare minimum needed to construct a 3 dimensional space. Think about taking 3 cards and taping them together with 2 pieces of rigid metal tape. You can make a good corner of a cube, even if it's a bit floppy on one edge. Gravity doesn't apply inside the RDBMS, so this works out.

This same reasoning can be extrapolated to higher, non-spatial dimensions. Think about adding time into that mix. In order to find an entity, you also now need to perform an inner join on that dimension and constrain on a specific time value. If you join but fail to constrain on a specific time, then you get a happy, comprehensive report of all the places an entity was over time.

The other join types are really minor variations on these themes once you have a deep conceptual grasp (i.e. can "rotate" the schema in your mind).

Playing around with some toy examples can do wonders for understanding. I sometimes find myself going back to the cartesian coordinate example when stuck trying to weld together 10+ dimensions in a real-world business situation.

[+] quasarj|2 years ago|reply

I now understand joins less than I did before.

[+] michaelmior|2 years ago|reply

This reminds me a lot of HyperDex[0], which hashes values into a multidimensional hyperspace based on their attributes for indexing purposes.

[0] https://dbdb.io/db/hyperdex

[+] rjbwork|2 years ago|reply

This is like, a hypernormalization of the data though, isn't it? I think in standard BCNF you'd just leave your table as

  CREATE TABLE EntityPosition (
    int EntityId,
    float X,
    float y,
    float z
  )

It does remind me of data warehouse stuff though, given we're working with aggregates and piecing together bits of various dimensions.

[+] spopejoy|2 years ago|reply

Your example raises a question I've never found a good answer to:

Why use `JOIN` (and `INNER JOIN` and friends) at all when you can write the exact join more accurately IMO with WHERE clauses (and using "*=" operators), and then just list all the tables in the FROM clause?

My brain always hurts trying to parse complex FROM clauses with various JOINs whereas reading the equivalent equations in WHERE clauses is way more straightforward to me.

[+] iamcreasy|2 years ago|reply

I think you are saying that every join is a variation of cross join. Did I understand you correctly?

[+] farkanoid|2 years ago|reply

Great. Now you've made me want to re-watch Stargate!

[+] roywiggins|2 years ago|reply

The 0th would be "it's an operator in relational algebra."

https://en.m.wikipedia.org/wiki/Relational_algebra

("The result of the natural join [R ⋈ S] is the set of all combinations of tuples in R and S that are equal on their common attribute names... it is the relational counterpart of the logical AND operator.")

⋈ amounts to a Cartesian product with a predicate that throws out the rows that shouldn't be in the result. Lots of SQL makes sense if you think of joins this way.

[+] jimwhite42|2 years ago|reply

In the functional interpretation of relational theory, a join is function composition, surprised that one was left out.

[+] gpderetta|2 years ago|reply

The cross-product plus predicate is referenced in "A join is a nested loop over rows".

[+] mirekrusin|2 years ago|reply

For several days I'm having trouble finding good resources on _implementation_ for query execution/planning (predicates, existing indices <<especially composite ones - how to locate them during planning etc>>, joins etc).

Google is spammed with _usage_.

Anybody has some recommendations at hand?

ps. the only one I found was CMU's Database Group resources, which are great

[+] gavinray|2 years ago|reply

There's an entire 700-page book about this you can access for free here:

"Building Query Compilers"

https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.p...

EDIT: You also might get use out of TUM's course "Database Systems on Modern CPU Architectures"

The year that contains all the video lectures is 2020:

https://db.in.tum.de/teaching/ss20/moderndbs/?lang=en

[+] sobellian|2 years ago|reply

I don't know if this has the level of detail you're looking for, but you might want to check https://www.sqlite.org/optoverview.html and https://www.sqlite.org/queryplanner.html.

[+] aidos|2 years ago|reply

My go to pointer for this is to read the Postgres docs (and / or source - which is also super readable).

https://www.postgresql.org/docs/current/planner-optimizer.ht...

[+] Sesse__|2 years ago|reply

Generally, this is a specialist topic, so you're unlikely to find e.g. good textbooks. It depends a bit on how deep you want to go and what specific parts you're interested in, though. (Query execution is pretty much entirely separate from planning, for one.)

The best join optimizer paper I know of is still the original Selinger paper (https://courses.cs.duke.edu/compsci516/cps216/spring03/paper...). It doesn't support outer joins and there are more efficient techniques by now, but anyone looking at a System R-like optimizer can read this and feel right at home. (There is also the Volcano/Cascades school of optimizers, which is rather different from System R, but I've found even less information on it.)

As others have said, Postgres' source and documentation is good. In particular, src/backend/optimizer/README contains a number of interesting things that you won't find a lot of other places. (The Postgres optimizer doesn't always use the latest fancy designs, but it's generally very polished and pretty easy to read.)

I can also echo Andy Pavlo's courses (at CMU), they're pretty much the only ones who will explain this stuff to you online. The “Building Query Compilers” PDF is rather incomplete and there's a lot of stuff I never cared for in there, but it contains several key papers from Moerkotte et al if you actually want to implement the modern stuff (DPhyp etc.). In general, most of the modern System R-like stuff (how to efficiently deal with outer joins, how to deal with interesting orders, how to deal with large queries) comes from the groups of Moerkotte and/or Neumann; all of these things had solutions before them, but less efficient and/or powerful and/or elegant.

Finding an applicable index isn't hard (you generally just try to see if it hits a predicate—a so-called “sargable predicate”, for SARG = Search ARGument). Estimating selectivity is hard. Estimating selectivity through joins is perhaps the hardest problem in optimizers, and nobody has truly solved it. This was one of the things I never really got to; there are so many hard subproblems. Like, for something that sounds really simple but isn't, what do you do if you have predicates A=x AND B=y AND C=z and you happen to have indexes on (A,B) and (B,C) (with selectivity/cardinality information) and want a selectivity for all three combined? There are papers that literally require you to build a “second-order cone programming” solver to solve this problem :-)

[+] mrkeen|2 years ago|reply

I just tried adding 'relational algebra' in front of my 'query planning' query, and at a glance it skews more towards implementation.

[+] malfist|2 years ago|reply

It seems content farms, farming keywords with just fluff has taken over google. Can't find how to do anything anymore, just pages and pages of people talking about doing something.

You could try your query on a different search engine. I've had good luck with kagi.

[+] skywhopper|2 years ago|reply

It’s out of date and probably has less detail than I remember but I got a lot out of “Inside Microsoft SQL Server 7.0” which does deep dives into storage architecture, indices, query planning etc from an MSSQL specific POV. The book was updated for SQL Server 2003 and 2008. Obviously the book also has a ton of stuff about MS specific features and admin functionality that’s not really relevant, and I’m sure there are better resources out there, but I’ve found the background from that book has helped me understand the innards of Oracle, MySQL, and Postgres in the years since.

[+] dattl|2 years ago|reply

I find the Advanced Databases Course from CMU an excellent resource. https://15721.courses.cs.cmu.edu/spring2023/schedule.html

You might want to look into academic papers, e.g., T. Neumann, Efficiently Compiling Efficient Query Plans for Modern Hardware, in VLDB, 2011 https://www.vldb.org/pvldb/vol4/p539-neumann.pdf

[+] maxdemarzi|2 years ago|reply

The 14th way is “multi way joins” also called “worst case optimal joins” which is a terrible name.

It means instead of joining tables two at a time and dealing with the temporary results along the way (eating memory), you join 3 or more tables together without the temporary results.

There is a blog post and short video of this on https://relational.ai/blog/dovetail-join and the original paper is on https://dl.acm.org/doi/pdf/10.1145/3180143

I work for RelationalAI, we and about 4 other new database companies are bringing these new join algorithms to market after ten years in academia.

[+] gavinray|2 years ago|reply

Justin also has a post on WCOJ that's really solid:

https://justinjaffray.com/a-gentle-ish-introduction-to-worst...

[+] namibj|2 years ago|reply

Negating inputs (set complement) turns the join's `AND` into a `NOR`, as Tetris exploits.

The worst case bounds don't tighten over (stateless/streaming) WCOJ's, but much real world data has far smaller box certificates.

One thing I didn't see is whether Dovetail join allows recursive queries (i.e., arbitrary datalog with a designated output relation, and the user having no concern about what the engine does with all the intermediate relations mentioned in the bundle of horn clauses that make up this datalog query).

Do you happen to know if it supports such queries?

[+] ttfkam|2 years ago|reply

Excellent! We need more articles like this that demonstrate the subtleties of the relational model to primarily app-level developers. The explanations and explorations in terms of functional programming are both concise and compelling.

[+] srcreigh|2 years ago|reply

Another missed chance to educate on the N+1 area. Join on unclustered index is still N+1, it’s just N+1 on disk instead of N+1 over the network and disk.

[+] tourist2d|2 years ago|reply

"Another missed chance to talk about X problem I find interest in and would bloat the article"

[+] smif|2 years ago|reply

An inner join is a Cartesian product with conditionals added.

[+] RobinL|2 years ago|reply

There's a big performance difference between creating the Cartesian product and then filtering for the conditional, and creating the conditionals directly. An inner join with equi join conditions creates the conditions directly; any non equi join conditions actually have to be evaluated

[+] conor-23|2 years ago|reply

This dudes blog is fire. Very nice explanations of complex database topics.

[+] gavinray|2 years ago|reply

Justin Jaffray is a gem

[+] charles_f|2 years ago|reply

Very nice explanation!

> The correct way to do this is to normalize the table

This is true for transactional dbs, but in data warehouses it's widely accepted that some degree of denormalization is the way to go

[+] tmpfile|2 years ago|reply

Great article. Side note: his normalization example reminded me how I used to design tables using a numeric primary key thinking they were more performant than strings. But then I’d have a meaningless id which required a join to get the unique value I actually wanted. One day I realized I could use the same unique key in both tables and save a join.

Simple realization. Big payoff

[+] roselan|2 years ago|reply

I still like to have a unique id field per table. It helps logging and it doesn't care about multi fields "real" key.

However I keep an unique index on the string value and more importantly point integrity constraints to it, mainly for readability. It's way easier to read a table full of meaningful strings rather than full of numerical id or uuids.

[+] Anon4Now|2 years ago|reply

A couple things:

This is admittedly a bit pedantic, but in E.F. Codd's original paper, he defined "relation" as the relation between a tuple (table row) and an attribute (table column) in a single table - https://en.wikipedia.org/wiki/Relation_(database). I'm not sure of the author's intent, but the user table example (user, country_id ) might imply the relationship between the user table and the country table. It's a common misconception about what "relational" means, but tbh I'm fine with that since it makes more sense to the average developer.

If you ever need to join sets of data in code, don't use nested loops - O(n^2). Use a map / dictionary. It's one of the few things I picked up doing Leetcode problems that I've actually needed to apply irl.

[+] abtinf|2 years ago|reply

I think understanding that “relation” means the relationships of attributes to a primary key is a crucial, foundational concept. Without it, you can’t properly understand a concept like normalization, why it arises, and when it applies. You can’t even fully grasp the scope of “data corruption” (most developers have an extremely poor understanding of how broad the notion of corruption is in databases).

[+] zzleeper|2 years ago|reply

I think it depends on the data.

If it's presorted by the join variable then rolling the loop is faster. Also, if the index is too big for memory, then it might be faster to loop.

[+] fuy|2 years ago|reply

I think the blog author uses it the same way Codd does, that's why he's talking about two columns from the same table.

[+] benjiweber|2 years ago|reply

Joins as a relational AND.

https://benjiweber.co.uk/blog/2021/03/21/thinking-in-questio....

[+] boredemployee|2 years ago|reply

I always thought venn diagram was a good representation but I think I was wrong.

Edit: Why did I get down voted? :)

[+] ericHosick|2 years ago|reply

I would also like to know why you are getting downvoted.

Even wikipedia uses a Venn diagram to explain JOIN https://en.wikipedia.org/wiki/Join_(SQL) .

Not trying to use an argument from authority but just pointing out that this is not unheard of.

[+] spiffytech|2 years ago|reply

I've always found Coding Horror's Venn diagram depiction of SQL joins quite enlightening:

https://blog.codinghorror.com/a-visual-explanation-of-sql-jo...

[+] somat|2 years ago|reply

Because a venn diagram does not describe join mechanics well. A venn diagram does however describes the UNION, INTERSECT, EXCEPT part of sql.

https://blog.jooq.org/say-no-to-venn-diagrams-when-explainin...

And more meta, it is an innocent slightly incorrect statement, stuff like that should not be down voted, reply with a correction. Save down votes for outright malicious posts.

[+] TechBro8615|2 years ago|reply

This is a nice way of explaining a concept, and could probably be applied to any complex topic. As someone who learns best by analogy, concepts usually "click" for me once I've associated them with a few other concepts. So I appreciate the format, and would personally enjoy seeing more explainers like this (and not just about database topics).

[+] waynecochran|2 years ago|reply

Can join also be thought of as unification? Similar to type checking.

https://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/le...

[+] BoppreH|2 years ago|reply

Fantastic post, I always enjoy reading about different computation mental models.

And under "A join is a…join", there's a typo in the partial order properties. It currently reads:

    1. Reflexivity: a≤b,

And I'm pretty sure it should be

    1. Reflexivity: a≤a,

instead (i.e., every element is ≤ to itself).

[+] foldU|2 years ago|reply

You are correct! I will fix it when I get home, thank you for the correction!

[+] lopatin|2 years ago|reply

Very useful. Something like this but for Flink’s fancy temporal, lateral, interval, and window joins would be great too.

[+] danbruc|2 years ago|reply

   1. cross join
   2. natural join
   3. equi join
   4. theta join
   5. inner join
   6. left outer join
   7. right outer join
   8. full outer join
   9. left semi join
  10. right semi join
  11. left anti semi join
  12. right anti semi join
  13. ???

76 comments