Let's talk about joins

[+] jarym|2 years ago|reply

> “Here we typically expect that the combined dataset will have the same number of rows as our original left side dataset.”

For left join this isn’t entirely true. If there are more matching cases on the right joined table then you’ll get additional rows for each match. That is unless you take steps to ensure only at most one row is matched per row on the left (eg using something like DISTINCT ON in Postgres)

[+] petalmind|2 years ago|reply

This is my pet peeve. Top google search results and ChatGPT suggest this "same number of rows" mental model which is incomplete and breaks.

I wrote about this: https://minimalmodeling.substack.com/p/many-explanations-of-...

[+] wood_spirit|2 years ago|reply

Yes I picked that up and was tempted to comment. But then half way down it addresses this with the section “Many relationships”:

> Until now we have discussed scenarios that are considered one-to-one merges. In these cases, we only expect one participant in a dataset to be joined to one instance of that same participant in the other dataset.

> However, there are scenarios where this will not be the case…

[+] contravariant|2 years ago|reply

The hidden requirement is that you need to join on a key, in an ideal world this would be the primary key of the right table and a foreign key in the left table.

Of course this requirement isn't ever enforced because the real world isn't kind enough to give strictly modelled data. It would simplify the query language a lot though.

[+] zX41ZdbW|2 years ago|reply

ClickHouse has support for this as an "ANY" JOIN modifier: https://clickhouse.com/docs/en/sql-reference/statements/sele...

[+] andy81|2 years ago|reply

If I was going to add anything -

"Vertical join" is normally called union.

"Right join" is just (bad) syntax sugar and should be avoided. Left join with the tables reversed is the usual convention.

The join condition for inner is optional- if it's always true then you get a "cross join". Can be useful to show all the possible combinations of two fields.

[+] Little_Kitty|2 years ago|reply

Of the tens of thousands of queries I've written I've needed right join the exactly once. It's a feature which is neat in that it exists, but the prevalence in teaching materials is entirely unjustified. Cross joins are massively more practical and enable some efficient transformations, but are usually taught only as all to all without a clear position on why they are useful.

[+] recursive|2 years ago|reply

How would you rewrite a left join followed by a right join? I don't think right joins are always sugar.

[+] wood_spirit|2 years ago|reply

All the other non-left joins are just syntactic sugar and can be expressed using only left join…?

[+] HermitX|2 years ago|reply

Excellent learning material, thanks for sharing. I've noticed an interesting trend: JOINS are crucial for data analytics, yet many new open-source data analytics products, or open-source OLAP products, offer limited support for JOINS. Examples include ClickHouse, Apache Druid, and Apache Pinot. It seems that currently, only Trino and StarRocks provide robust support for JOINS. Commercial products tend to have better support in this area. I hope the various open-source projects will also enhance their JOIN capabilities.

[+] closeparen|2 years ago|reply

If you want arbitrarily powerful adhoc query support, you need to wait for data to land in an offline warehouse or lake environment where you have access to e.g. Presto/Trino and Spark. If you want a near-real-time view then you’re going to need to design the data layout around your query pattern - do a streaming join or other enrichment prior to OLAP ingestion.

[+] totalhack|2 years ago|reply

I've also been frustrated when testing out tools that kinda keep you locked into one predetermined view, table, or set of tables at a time. I made a semantic data modeling library that puts together queries (and of course joins) for you as it uses a drill-across querying technique, and can also join data across different data sources in a secondary execution layer.

https://github.com/totalhack/zillion

Disclaimer: this project is currently a one man show, though I use it in production at my own company.

[+] minitoar|2 years ago|reply

The reason those tools have more limited support for joins is mainly because they are making intentional trade offs in favor of other features, eg performance in a particular domain.

[+] teunispeters|2 years ago|reply

A related note, PostgreSQL is very good at joins, but MySQL - with at the time much larger share - was never very good at them (at the time). (I last explored this 2016 and before). But a lot of web interfaces to data exploration (then) were based on MySQL and its quirks, and that colours perspectives a lot.

[+] antoniojtorres|2 years ago|reply

I’m curious about the limitations you’ve encountered with ClickHouse’s JOINS, I’ve found it sufficiently robust for dealing with the typical operations

[+] jethkl|2 years ago|reply

The article omits the "natural join". A natural join automatically joins tables on columns having the same name, and the columns in the resulting output appear exactly once. Natural joins often simplify queries, and they automatically adapt to schema changes in source tables. Natural joins also encourage standardized naming conventions, which has longer-term benefits.

[+] cldellow|2 years ago|reply

Natural joins also automatically break your queries when two columns happen to share a name but not the same meaning.

Step 1: use natural join. Life is great. Step 2: someone adds a `comment` field on table A. Life is great. Step 3: someone adds a `comment` field on table B. Ruh roh.

I'll use them in short-lived personal projects, but not on something where I'm collaborating with other people on software that evolves over several years.

[+] youerbt|2 years ago|reply

I feel like natural joins simplify writing queries, but not exactly reading them (especially if you are not familiar with the database). IMO a good compromise is the USING clause, which acts like a natural join, but columns have to be named explicitly.

[+] civilized|2 years ago|reply

> Natural joins also encourage standardized naming conventions, which has longer-term benefits.

This is a very positive spin on "you have to manage column names very rigorously for this strategy to be sustainable".

[+] hipadev23|2 years ago|reply

That’s not a join type, that’s syntactic sugar.

[+] layer8|2 years ago|reply

> the columns in the resulting output appear exactly once

No, records with the same join-column value are multiplied.

[+] genman|2 years ago|reply

Good material.

Joins can be also categorized by the used join algorithm.

The simplest join is a nested loop join. In this case a larger dataset is iterated and another small dataset is combined using a simple key lookup.

Then there is a merge join. In this case two larger datasets are first sorted using merge sort and then aligned linearly.

Then there is a hash join. For the hash join a hash table is generated first based on the smaller dataset and the larger dataset is iterated and the join is made by making a hash lookup to the generated hash table.

The difference between nested loop join and hash join might be confusing.

In case of a nested loop join the second table is not loaded from the storage medium first, instead an index is used to lookup the location of the records. This has O(log n) complexity for each lookup. For hash join the table is loaded and hash table is generated. In this case each lookup has O(1) complexity but creation of hash table is expensive (it has O(n) complexity) and is only worth the cost when the dataset is relatively large.

[+] tmoertel|2 years ago|reply

In distributed systems, there's even a broadcast hash join, in which the hash table is not distributed across shards and assigned to workers but copied in full to every worker. That way, the other side of the join need not be sharded and shuffled to align join keys across workers. This strategy can save a lot of time and network bandwidth when one side of the join is small enough to fit into a worker's RAM and the other side is staggeringly massive (e.g., logs). It lets the massive side be processed in place, as it's streamed from storage.

[+] mgaunard|2 years ago|reply

There are lots of other joins not mentioned there.

A popular one with time series is the asof join.

There are also literal joins, which are generalizations of adjacent difference.

[+] mgaunard|2 years ago|reply

meant lAteral sorry

[+] zzzeek|2 years ago|reply

when I saw the term "horizontal joins", I immediately went to, what? what's a "vertical join?" must be a UNION, scrolled down and sure enough.

Is there some reason to use non-standard terminology for posts that are trying to be in-depth, authoritative tutorials ?

[+] nikhilsimha|2 years ago|reply

Never heard a union be called a “vertical join” before.

[+] petalmind|2 years ago|reply

It seems that lots of people independently coin this expression. I did it too a couple of years ago. Just checked Google, and there are lots of hits for this.

[+] layer8|2 years ago|reply

It’s like calling a sum an additive product. Doesn’t really make a lot of sense.

[+] frogamel|2 years ago|reply

Dplyr actually supports some really cool join functionalities that I wish were in SQL implementations, including:

- Ability to specify whether your join should be one-to-one, many-to-one, etc. So that R will throw an error instead of quietly returning 100x as many rows as expected (which I've seen a lot in SQL pipelines).

- A direct anti_join function. Much cleaner than using LEFT JOIN... WHERE b IS NULL to replicate an anti join.

- Support for rolling joins. E.g. for each user, get the price of their last transaction. Super common but can be a pain in SQL since it requires nested subqueries or CTEs.

[+] sa-code|2 years ago|reply

Polars does this too!

[+] holoduke|2 years ago|reply

Its funny that the older i get 40 the more it seems like my surrounding doesn't even understand the basics of sql. In my 20s everyone knew sql. It was required for almost every job. And it is still very useful knowledge. Difference between left, right and inner joins is ultra basic. If you dont know that i would be very worried.

[+] dan-robertson|2 years ago|reply

This article is targeted at people using R to analyse data which is probably more as-hoc and not in a sql database. I don’t think the target audience (might include eg scientists who are working with experiment results) should be expected to know SQL, and even if they do, it’s still useful to understand the dplyr functions,

[+] brodo|2 years ago|reply

I see more and more people coming out of college (or through apprenticeships) with miserable SQL skills. Relational databases are boring technology, and people want to work with shiny new things. People don't want to learn it, and they rely on ORMs so they don't have to. The result is miserable performance and reimplementing parts of SQL badly in the application itself.

[+] airstrike|2 years ago|reply

> Let’s try this again using R.

This should actually be "Let's try this again using dplyr, one of the most elegant pieces of software ever written".

Hadley Wickham is a treasure to humanity.

[+] tosh|2 years ago|reply

asof join:

https://code.kx.com/q/ref/asof/

https://code.kx.com/q/learn/brief-introduction/#time-joins

https://clickhouse.com/docs/en/sql-reference/statements/sele...

https://duckdb.org/docs/guides/sql_features/asof_join.html

> Do you have time series data that you want to join, but the timestamps don’t quite match? Or do you want to look up a value that changes over time using the times in another table?

DuckDB blog post on temporal joins:

https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-look...

note: this idea is very useful for timestamps but can also be used for other types of values

[+] jamesblonde|2 years ago|reply

The ASOF LEFT JOIN is the main join used to create training data, where the labels are in the left table, and features are in the tables on the RHS. Note, we use DuckDB for its left asof join capabilities, but most vendors use Spark which is hideously slow for left asof joins.

[+] adrianmonk|2 years ago|reply

From the opening of the "Vertical Joins" section:

> Similar to horizontal joins, there are many use cases for joining data horizontally, also called appending data.

Shouldn't this read "joining data vertically"? This seems like a typo.

[+] cghlewis|2 years ago|reply

This is definitely a typo! Thanks for catching this!

[+] unknown|2 years ago|reply

[deleted]

[+] lakomen|2 years ago|reply

Beginners talking about SQL joins. What is this newbienews or hackernews? "But but ChatGPT says"...

Disgusting

[+] brodo|2 years ago|reply

I like reading well-written articles about basic things, even as a senior dev. My job involves mentoring people, and it is always good to have some articles at hand to give them to read if they struggle with a topic. Moreover, it is interesting to see how the concepts are thought. I like to give people several explanations for the same thing in the hope that at least one sticks.

76 comments