top | item 34182803

(no title)

> Joins are cheap. Normalise earlier

I am sure i have heard strong opinions like this but in practice joins are never cheap, tables with billions of rows crossed with millions of rows just to find a single row with data is not something i would call cheap, more often than not it is better to avoid joining large tables if you can live with duplicate data, also one trick i have found better is just to archive data in your tables which are not accessed frequently and reduce the size of tables keeping only the data which is needed

discuss

_a_a_a_|3 years ago

aaand here we go again.

DB guy with 25+ years experience. Summary: it depends.

> joins are never cheap

it depends. On table size, indexes/table size vs how expensive the alternative is. Always!

> tables with billions of rows crossed with millions of rows just to find a single row with data is not something i would call cheap

indexes

> more often than not it is better to avoid joining large tables if you can live with duplicate data

1E9 x 1E6 = 1E15 (at worst anyway). A join via an index will save you colossal amounts of IO (though as ever, it depends).

Problem here isn't this mostly clueless advice (discarding/archiving unnecessary data is the only good idea here, and it's not used as often as it should be). Problem is strong opinions put forth by someone who doesn't have the necessary experience, or understanding of what's going on under the hood. Denormalising is a useful tool that IME rarely gains you more than it loses you, but this 'advice' is just going to lead people down the wrong alley, and I'm tired of suchlike n00b advice strongly (and incorrectly and arrogantly) expressed on HN.

(edited to fix maths error)

Akronymus|3 years ago

There's also the possibility of filtering each source table first, then doing an inner join. Which can VASTLY cut down on computation. I assume GP assumed doing an outer join first, then filtering.

But those are details for the database engine to handle. And, as you said, indexes

jteppinette|3 years ago

Wow, this comment comes across as being incredibly arrogant while providing zero value. nOOb lol

bazoom42|3 years ago

Joins are cheaper than the alternatives.

If you have billions of rows you need to search through for a single row, it will be costly unless you have the appropriate indexes. Storing data in JSON fields or similar is only going to make it slower.

If you are only selecting a single row, a join will be instantaneous. If you are joining many rows on either side it has a cost, but so has denormalized data since you just need to scan through that many more rows. Sure in certain specific cases denormalization can be a valid optimization, but in the general case it will just make queries slower.

Saying joins are cheap or expensive only makes sense when comparing to the alternative.

uuvs8|3 years ago

> If you are only selecting a single row, a join will be instantaneous.

Exactly. It's not like the join materializes explicitly in memory.

This has all been known for decades since the beginning of relational databases. That's why taking a DB class is valuable, or at least reading a good DB fundamentals book.

joshka|3 years ago

Came here to say the same :D

But to steel man this a little. Joins can be cheap if you understand how they work and what the patterns of use are. I've worked on systems where that billions of rows to millions scenario is a thing. I recall working on a particularly nasty legacy query underpinning a report that took most of a day to finish that I managed to get down to tens of seconds just by tweaking a bunch of things to join right.

wfriesen|3 years ago

Joins are cheap when joining on indexed columns, but the trade off is extra time maintaining those indexes when writing. As always, it depends on the use case.

marginalia_nu|3 years ago

Depends a lot on the size of the tables, but that is true for anything to do with databases. It's probably hard to give one-size-fits-all advice about database management, since you're ultimately balancing conflicting interests (query performance, maintainability, disk space).

With a sufficiently small database, even non-indexed joins may appear fast.

If your database has billions of rows or more, then even indexed joins will need to be used judiciously, especially if you have a lot of indexes. The indexes will probably also become very large on disk (possibly 100s of Gb), and they'll also degrade the performance since more than likely the system will struggle to keep them in RAM.

smackeyacky|3 years ago

Joins can be cheap if you keep half an eye on the query optimiser. Just make sure that the filters only grab what is necessary from your huge tables before the join part starts and a join can be blazingly fast on most modern databases.

vore|3 years ago

It really depends where you put the cost in the end: do you pay the cost on the query side or the manually managing data integrity side?

Even if you don't denormalize, there's plenty of optimizations for joins: e.g. a bitmap join index can optimize a millions x billions join pretty well!

uuvs8|3 years ago

> tables with billions of rows crossed with millions of rows just to find a single row with data is not something i would call cheap,

Um ..yes? It's cheap if you got your indices right. A hash table doesn't really care if there are 20 or 20B entries.

TylerE|3 years ago

Normalize your DATA.

Materialize views for high-performance querying.

uwuemu|3 years ago

[deleted]

aaomidi|3 years ago

No, joins on single row to single row with indices are cheap - which is what the article is saying.

You can also use views to keep the joins in place so you don’t need to keep joining.