Once I started thinking about joins in terms of spatial dimensions, things got a lot easier to reason with. I like to think of the inner join like the scene from Stargate where they were describing how gate addressing works. Assume you have a contrived example with 3 tables describing the position of an off world probe in each dimension - X, Y and Z. Each table looks like:
CREATE TABLE Dim_X (
int EntityId
float Value
)
Establishing a point in space for a given entity (regardless of how you derive that ID) is then a matter of:
SELECT Dim_X.Value AS X, Dim_Y.Value as Y, Dim_Z.Value as Z
FROM Dim_X, Dim_Y, Dim_Z
WHERE Dim_X.EntityId = Dim_Y.EntityId --Join 1
AND Dim_Y.EntityId = Dim_Z.EntityId --Join 2
AND Dim_X.EntityId = @MyEntityId --The entity we want to find the 3D location of
You will note that there are 2 inner joins used in this example. That is the bare minimum needed to construct a 3 dimensional space. Think about taking 3 cards and taping them together with 2 pieces of rigid metal tape. You can make a good corner of a cube, even if it's a bit floppy on one edge. Gravity doesn't apply inside the RDBMS, so this works out.
This same reasoning can be extrapolated to higher, non-spatial dimensions. Think about adding time into that mix. In order to find an entity, you also now need to perform an inner join on that dimension and constrain on a specific time value. If you join but fail to constrain on a specific time, then you get a happy, comprehensive report of all the places an entity was over time.
The other join types are really minor variations on these themes once you have a deep conceptual grasp (i.e. can "rotate" the schema in your mind).
Playing around with some toy examples can do wonders for understanding. I sometimes find myself going back to the cartesian coordinate example when stuck trying to weld together 10+ dimensions in a real-world business situation.
Your example raises a question I've never found a good answer to:
Why use `JOIN` (and `INNER JOIN` and friends) at all when you can write the exact join more accurately IMO with WHERE clauses (and using "*=" operators), and then just list all the tables in the FROM clause?
My brain always hurts trying to parse complex FROM clauses with various JOINs whereas reading the equivalent equations in WHERE clauses is way more straightforward to me.
("The result of the natural join [R ⋈ S] is the set of all combinations of tuples in R and S that are equal on their common attribute names... it is the relational counterpart of the logical AND operator.")
⋈ amounts to a Cartesian product with a predicate that throws out the rows that shouldn't be in the result. Lots of SQL makes sense if you think of joins this way.
For several days I'm having trouble finding good resources on _implementation_ for query execution/planning (predicates, existing indices <<especially composite ones - how to locate them during planning etc>>, joins etc).
Google is spammed with _usage_.
Anybody has some recommendations at hand?
ps. the only one I found was CMU's Database Group resources, which are great
Generally, this is a specialist topic, so you're unlikely to find e.g. good textbooks. It depends a bit on how deep you want to go and what specific parts you're interested in, though. (Query execution is pretty much entirely separate from planning, for one.)
The best join optimizer paper I know of is still the original Selinger paper (https://courses.cs.duke.edu/compsci516/cps216/spring03/paper...). It doesn't support outer joins and there are more efficient techniques by now, but anyone looking at a System R-like optimizer can read this and feel right at home. (There is also the Volcano/Cascades school of optimizers, which is rather different from System R, but I've found even less information on it.)
As others have said, Postgres' source and documentation is good. In particular, src/backend/optimizer/README contains a number of interesting things that you won't find a lot of other places. (The Postgres optimizer doesn't always use the latest fancy designs, but it's generally very polished and pretty easy to read.)
I can also echo Andy Pavlo's courses (at CMU), they're pretty much the only ones who will explain this stuff to you online. The “Building Query Compilers” PDF is rather incomplete and there's a lot of stuff I never cared for in there, but it contains several key papers from Moerkotte et al if you actually want to implement the modern stuff (DPhyp etc.). In general, most of the modern System R-like stuff (how to efficiently deal with outer joins, how to deal with interesting orders, how to deal with large queries) comes from the groups of Moerkotte and/or Neumann; all of these things had solutions before them, but less efficient and/or powerful and/or elegant.
Finding an applicable index isn't hard (you generally just try to see if it hits a predicate—a so-called “sargable predicate”, for SARG = Search ARGument). Estimating selectivity is hard. Estimating selectivity through joins is perhaps the hardest problem in optimizers, and nobody has truly solved it. This was one of the things I never really got to; there are so many hard subproblems. Like, for something that sounds really simple but isn't, what do you do if you have predicates A=x AND B=y AND C=z and you happen to have indexes on (A,B) and (B,C) (with selectivity/cardinality information) and want a selectivity for all three combined? There are papers that literally require you to build a “second-order cone programming” solver to solve this problem :-)
It seems content farms, farming keywords with just fluff has taken over google. Can't find how to do anything anymore, just pages and pages of people talking about doing something.
You could try your query on a different search engine. I've had good luck with kagi.
It’s out of date and probably has less detail than I remember but I got a lot out of “Inside Microsoft SQL Server 7.0” which does deep dives into storage architecture, indices, query planning etc from an MSSQL specific POV. The book was updated for SQL Server 2003 and 2008. Obviously the book also has a ton of stuff about MS specific features and admin functionality that’s not really relevant, and I’m sure there are better resources out there, but I’ve found the background from that book has helped me understand the innards of Oracle, MySQL, and Postgres in the years since.
The 14th way is “multi way joins” also called “worst case optimal joins” which is a terrible name.
It means instead of joining tables two at a time and dealing with the temporary results along the way (eating memory), you join 3 or more tables together without the temporary results.
Negating inputs (set complement) turns the join's `AND` into a `NOR`, as Tetris exploits.
The worst case bounds don't tighten over (stateless/streaming) WCOJ's, but much real world data has far smaller box certificates.
One thing I didn't see is whether Dovetail join allows recursive queries (i.e., arbitrary datalog with a designated output relation, and the user having no concern about what the engine does with all the intermediate relations mentioned in the bundle of horn clauses that make up this datalog query).
Do you happen to know if it supports such queries?
Excellent! We need more articles like this that demonstrate the subtleties of the relational model to primarily app-level developers. The explanations and explorations in terms of functional programming are both concise and compelling.
Another missed chance to educate on the N+1 area. Join on unclustered index is still N+1, it’s just N+1 on disk instead of N+1 over the network and disk.
There's a big performance difference between creating the Cartesian product and then filtering for the conditional, and creating the conditionals directly. An inner join with equi join conditions creates the conditions directly; any non equi join conditions actually have to be evaluated
Great article. Side note: his normalization example reminded me how I used to design tables using a numeric primary key thinking they were more performant than strings. But then I’d have a meaningless id which required a join to get the unique value I actually wanted. One day I realized I could use the same unique key in both tables and save a join.
I still like to have a unique id field per table. It helps logging and it doesn't care about multi fields "real" key.
However I keep an unique index on the string value and more importantly point integrity constraints to it, mainly for readability. It's way easier to read a table full of meaningful strings rather than full of numerical id or uuids.
This is admittedly a bit pedantic, but in E.F. Codd's original paper, he defined "relation" as the relation between a tuple (table row) and an attribute (table column) in a single table - https://en.wikipedia.org/wiki/Relation_(database). I'm not sure of the author's intent, but the user table example (user, country_id ) might imply the relationship between the user table and the country table. It's a common misconception about what "relational" means, but tbh I'm fine with that since it makes more sense to the average developer.
If you ever need to join sets of data in code, don't use nested loops - O(n^2). Use a map / dictionary. It's one of the few things I picked up doing Leetcode problems that I've actually needed to apply irl.
I think understanding that “relation” means the relationships of attributes to a primary key is a crucial, foundational concept. Without it, you can’t properly understand a concept like normalization, why it arises, and when it applies. You can’t even fully grasp the scope of “data corruption” (most developers have an extremely poor understanding of how broad the notion of corruption is in databases).
And more meta, it is an innocent slightly incorrect statement, stuff like that should not be down voted, reply with a correction. Save down votes for outright malicious posts.
This is a nice way of explaining a concept, and could probably be applied to any complex topic. As someone who learns best by analogy, concepts usually "click" for me once I've associated them with a few other concepts. So I appreciate the format, and would personally enjoy seeing more explainers like this (and not just about database topics).
[+] [-] bob1029|2 years ago|reply
This same reasoning can be extrapolated to higher, non-spatial dimensions. Think about adding time into that mix. In order to find an entity, you also now need to perform an inner join on that dimension and constrain on a specific time value. If you join but fail to constrain on a specific time, then you get a happy, comprehensive report of all the places an entity was over time.
The other join types are really minor variations on these themes once you have a deep conceptual grasp (i.e. can "rotate" the schema in your mind).
Playing around with some toy examples can do wonders for understanding. I sometimes find myself going back to the cartesian coordinate example when stuck trying to weld together 10+ dimensions in a real-world business situation.
[+] [-] quasarj|2 years ago|reply
[+] [-] michaelmior|2 years ago|reply
[0] https://dbdb.io/db/hyperdex
[+] [-] rjbwork|2 years ago|reply
[+] [-] spopejoy|2 years ago|reply
Why use `JOIN` (and `INNER JOIN` and friends) at all when you can write the exact join more accurately IMO with WHERE clauses (and using "*=" operators), and then just list all the tables in the FROM clause?
My brain always hurts trying to parse complex FROM clauses with various JOINs whereas reading the equivalent equations in WHERE clauses is way more straightforward to me.
[+] [-] iamcreasy|2 years ago|reply
[+] [-] farkanoid|2 years ago|reply
[+] [-] roywiggins|2 years ago|reply
https://en.m.wikipedia.org/wiki/Relational_algebra
("The result of the natural join [R ⋈ S] is the set of all combinations of tuples in R and S that are equal on their common attribute names... it is the relational counterpart of the logical AND operator.")
⋈ amounts to a Cartesian product with a predicate that throws out the rows that shouldn't be in the result. Lots of SQL makes sense if you think of joins this way.
[+] [-] jimwhite42|2 years ago|reply
[+] [-] gpderetta|2 years ago|reply
[+] [-] mirekrusin|2 years ago|reply
Google is spammed with _usage_.
Anybody has some recommendations at hand?
ps. the only one I found was CMU's Database Group resources, which are great
[+] [-] gavinray|2 years ago|reply
"Building Query Compilers"
https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.p...
EDIT: You also might get use out of TUM's course "Database Systems on Modern CPU Architectures"
The year that contains all the video lectures is 2020:
https://db.in.tum.de/teaching/ss20/moderndbs/?lang=en
[+] [-] sobellian|2 years ago|reply
[+] [-] aidos|2 years ago|reply
https://www.postgresql.org/docs/current/planner-optimizer.ht...
[+] [-] Sesse__|2 years ago|reply
The best join optimizer paper I know of is still the original Selinger paper (https://courses.cs.duke.edu/compsci516/cps216/spring03/paper...). It doesn't support outer joins and there are more efficient techniques by now, but anyone looking at a System R-like optimizer can read this and feel right at home. (There is also the Volcano/Cascades school of optimizers, which is rather different from System R, but I've found even less information on it.)
As others have said, Postgres' source and documentation is good. In particular, src/backend/optimizer/README contains a number of interesting things that you won't find a lot of other places. (The Postgres optimizer doesn't always use the latest fancy designs, but it's generally very polished and pretty easy to read.)
I can also echo Andy Pavlo's courses (at CMU), they're pretty much the only ones who will explain this stuff to you online. The “Building Query Compilers” PDF is rather incomplete and there's a lot of stuff I never cared for in there, but it contains several key papers from Moerkotte et al if you actually want to implement the modern stuff (DPhyp etc.). In general, most of the modern System R-like stuff (how to efficiently deal with outer joins, how to deal with interesting orders, how to deal with large queries) comes from the groups of Moerkotte and/or Neumann; all of these things had solutions before them, but less efficient and/or powerful and/or elegant.
Finding an applicable index isn't hard (you generally just try to see if it hits a predicate—a so-called “sargable predicate”, for SARG = Search ARGument). Estimating selectivity is hard. Estimating selectivity through joins is perhaps the hardest problem in optimizers, and nobody has truly solved it. This was one of the things I never really got to; there are so many hard subproblems. Like, for something that sounds really simple but isn't, what do you do if you have predicates A=x AND B=y AND C=z and you happen to have indexes on (A,B) and (B,C) (with selectivity/cardinality information) and want a selectivity for all three combined? There are papers that literally require you to build a “second-order cone programming” solver to solve this problem :-)
[+] [-] mrkeen|2 years ago|reply
[+] [-] malfist|2 years ago|reply
You could try your query on a different search engine. I've had good luck with kagi.
[+] [-] skywhopper|2 years ago|reply
[+] [-] dattl|2 years ago|reply
You might want to look into academic papers, e.g., T. Neumann, Efficiently Compiling Efficient Query Plans for Modern Hardware, in VLDB, 2011 https://www.vldb.org/pvldb/vol4/p539-neumann.pdf
[+] [-] maxdemarzi|2 years ago|reply
It means instead of joining tables two at a time and dealing with the temporary results along the way (eating memory), you join 3 or more tables together without the temporary results.
There is a blog post and short video of this on https://relational.ai/blog/dovetail-join and the original paper is on https://dl.acm.org/doi/pdf/10.1145/3180143
I work for RelationalAI, we and about 4 other new database companies are bringing these new join algorithms to market after ten years in academia.
[+] [-] gavinray|2 years ago|reply
https://justinjaffray.com/a-gentle-ish-introduction-to-worst...
[+] [-] namibj|2 years ago|reply
The worst case bounds don't tighten over (stateless/streaming) WCOJ's, but much real world data has far smaller box certificates.
One thing I didn't see is whether Dovetail join allows recursive queries (i.e., arbitrary datalog with a designated output relation, and the user having no concern about what the engine does with all the intermediate relations mentioned in the bundle of horn clauses that make up this datalog query).
Do you happen to know if it supports such queries?
[+] [-] ttfkam|2 years ago|reply
[+] [-] srcreigh|2 years ago|reply
[+] [-] tourist2d|2 years ago|reply
[+] [-] smif|2 years ago|reply
[+] [-] RobinL|2 years ago|reply
[+] [-] conor-23|2 years ago|reply
[+] [-] gavinray|2 years ago|reply
[+] [-] charles_f|2 years ago|reply
> The correct way to do this is to normalize the table
This is true for transactional dbs, but in data warehouses it's widely accepted that some degree of denormalization is the way to go
[+] [-] tmpfile|2 years ago|reply
Simple realization. Big payoff
[+] [-] roselan|2 years ago|reply
However I keep an unique index on the string value and more importantly point integrity constraints to it, mainly for readability. It's way easier to read a table full of meaningful strings rather than full of numerical id or uuids.
[+] [-] Anon4Now|2 years ago|reply
This is admittedly a bit pedantic, but in E.F. Codd's original paper, he defined "relation" as the relation between a tuple (table row) and an attribute (table column) in a single table - https://en.wikipedia.org/wiki/Relation_(database). I'm not sure of the author's intent, but the user table example (user, country_id ) might imply the relationship between the user table and the country table. It's a common misconception about what "relational" means, but tbh I'm fine with that since it makes more sense to the average developer.
If you ever need to join sets of data in code, don't use nested loops - O(n^2). Use a map / dictionary. It's one of the few things I picked up doing Leetcode problems that I've actually needed to apply irl.
[+] [-] abtinf|2 years ago|reply
[+] [-] zzleeper|2 years ago|reply
If it's presorted by the join variable then rolling the loop is faster. Also, if the index is too big for memory, then it might be faster to loop.
[+] [-] fuy|2 years ago|reply
[+] [-] benjiweber|2 years ago|reply
https://benjiweber.co.uk/blog/2021/03/21/thinking-in-questio....
[+] [-] boredemployee|2 years ago|reply
Edit: Why did I get down voted? :)
[+] [-] ericHosick|2 years ago|reply
Even wikipedia uses a Venn diagram to explain JOIN https://en.wikipedia.org/wiki/Join_(SQL) .
Not trying to use an argument from authority but just pointing out that this is not unheard of.
[+] [-] spiffytech|2 years ago|reply
https://blog.codinghorror.com/a-visual-explanation-of-sql-jo...
[+] [-] somat|2 years ago|reply
https://blog.jooq.org/say-no-to-venn-diagrams-when-explainin...
And more meta, it is an innocent slightly incorrect statement, stuff like that should not be down voted, reply with a correction. Save down votes for outright malicious posts.
[+] [-] TechBro8615|2 years ago|reply
[+] [-] waynecochran|2 years ago|reply
https://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/le...
[+] [-] BoppreH|2 years ago|reply
And under "A join is a…join", there's a typo in the partial order properties. It currently reads:
And I'm pretty sure it should be instead (i.e., every element is ≤ to itself).[+] [-] foldU|2 years ago|reply
[+] [-] lopatin|2 years ago|reply
[+] [-] danbruc|2 years ago|reply
[+] [-] iaabtpbtpnn|2 years ago|reply
[+] [-] kadenwolff|2 years ago|reply
[+] [-] captaintobs|2 years ago|reply