top | item 37327764

(no title)

thargor90 | 2 years ago

That's weird. I thought this would be optimized based on database statistics and not depend on the query syntax. At least for those simple queries.

Can anyone elaborate why this is the case?

discuss

ZeroClickOk|2 years ago

from the article:

"For the OR clause query, the DBMS sequentially compares each condition's value one by one. In the given example, it performs three comparisons for each row to determine if (price = 1), (price = 2), and then (price = 3). This evaluation approach means that for N predicates, the complexity of the filtering operation per row is O(N).

On the other hand, with the IN clause, PostgreSQL builds a temporary hash table populated with the elements in the query’s IN clause (ExecEvalHashedScalarArrayOp). Then as the DBMS scans each tuple, it probes this hash table to see whether the tuple's attribute matches with any entry. This hash-based evaluation has a more efficient complexity of O(1) since the DBMS only needs to perform one lookup in the hash table per tuple.

Hence, as the number of predicates increases in a query, the OR clause incurs significant overhead because the DBMS evaluates predicates for each row individually, in contrast to the more efficient IN clause."

thargor90|2 years ago

Yes, this explains the how, but not the why.

Why is there no optimization in place for this? Converting a=x or a=y or a=z to a in(x,y,z) should be trivial and the db should have heuristics to calculate the expected query cost to decide when to apply this transformation.