(no title)
j-pb | 1 month ago
There is too much heuristic fiddling involved, and way too many niche algorithms that get cobbled together with an optimiser.
As if we're missing the theory to actually solve the stuff, so we're instead hobbling along by covering as many corner cases as we can, completely missing some elegant and profound beauty.
mamcx|1 month ago
Attempt to implement them, manually, and you see how hard is it.
PLUS, not only you need to account for the general solution, but what could be the best considering the current data set.
And, you can't compile statically (dynamically sure).
And, should work interactively, so hopefully be solved faster than run the actual query.
P.D: Joins are normally the focus, but other constructs are also challenging. For example, just solving if and which indexes to pick can be challenging when you have dozens of predicates.
And yes, your optimizer should survive(eventually!) to solve when you get feed hundreds of joins, predicates, aggregates, sorting and arbitrary expressions.
* I worked in the optimizer of a database. EVERYTHING is tricky!
akoboldfrying|1 month ago
akoboldfrying|1 month ago
So it's possible that the full optimisation problem over all join tree shapes is "easy", even though an output-constrained version of it is NP-hard... But I think that's unlikely. Having an NP-hard constrained variant like this strongly suggests that the original problem is itself NP-hard, and I suspect this could be shown by some other reduction.
> with selectivity a function of only the two immediate child nodes in the join
This should be "with selectivity a function of the rightmost leaves of the two child subtrees", so that it still makes sense for general ("bushy") join trees. (I know, I'm talking to myself... But I needed to write this down to convince myself that the original unconstrained problem wasn't just the (very easy) minimum spanning tree problem in disguise.)
sethev|1 month ago
For example there's a class of join algorithms called 'worst-case optimal' - which is not a great name, but basically means that these algorithms run in time proportional to the worst-case output size. These algorithms ditch the two at a time approach that databases typically use and joins multiple relations at the same time.
One example is the leapfrog trie join which was part of the LogicBlox database.
mihailstoian|1 month ago
1. LpBound: join size upper bounds. It still doesn't have full SQL coverage, e.g., string predicates, window functions, subqueries etc., but as with all cool stuff, it takes time to build it.
2. xBound: join size lower bounds. We showed how to do it at least for multi-way joins on the same join key, e.g., many subexpressions of the JOB benchmark have this shape. Still open how to do the rest - I'd say even harder than for upper bounds! (NB: I'm an author.)
[1] LpBound: https://arxiv.org/abs/2502.05912
[2] xBound: https://arxiv.org/abs/2601.13117
jasonwatkinspdx|1 month ago
maweki|1 month ago
I was also missing mentioning "sideways information passing", though some of methods are exactly that.
I am wondering whether the company consults literature or whether they fiddle about, mostly reinventing the wheel.
Sesse__|1 month ago
The really hard problem is estimating the cost of each plan once you've generated it, which necessarily must happen by some sort of heuristics combined with statistics. In particular: If you want to join A, B and C, the cost of (A JOIN B) JOIN C versus A JOIN (B JOIN C) can differ by many orders of magnitude, depending on the size of the intermediate products. And both the cost effects and any misestimation tend to compound through the plan as you add more tables and predicates.
nine_k|1 month ago
Sesse__|1 month ago
There _are_ tons of corner cases that you need to address since there are some super-hard problems in there (in particular, robust cardinality estimation of join outputs is a problem so hard that most of academia barely wants to touch it, despite its huge importance), but it doesn't need to be this bad.
willvarfar|1 month ago
10000truths|1 month ago
Optimize for the p99/p99.9/worst case scenarios. Minimize unpredictability in performance where possible, even if it comes at a small cost of median/average performance. Your SREs will thank you.
dapperdrake|1 month ago
Just look at when SAS programmers are advised to use a merge or a format.
Even hash-join vs merge-join really depend on your data's cardinality (read: sizes), indices, etc.
EDIT: Other comments also point out that there are non-general joins that are already NP-hard to optimize. You really want all the educated guesses you can get.
imtringued|1 month ago
RaftPeople|1 month ago
Requires some dynamic SQL to construct, but the beauty is that you can use the SQL engine for this solution:
select top 1 *
from (select
t1.id,t2.id,...,tn.id
,sum(t1.cost+t2.cost...+tn.cost) as total_cost
from join_options t1
cross join join_options t2
...
cross join join_options tn
group by t1.id,t2.id,...,tn.id) t0
order by
t0.total_cost