top | item 7271389

(no title)

pdubs | 12 years ago

Regarding #5: >That caught me by surprise. Both options “roughly the same” and “depends on the data” got about 25% — the guessing probably.

I don't think it was guessing so much as reasoning that fetching 100 rows (and filtering by value) instead of 10 rows doesn't have significant real-world impact unless the row data is particularly large. I'll admit I didn't think of the out-of-index lookup, but my main thought was 100/1000000 vs 10/1000000 isn't a big deal unless the DB is running on an abacus.

discuss

order

pradocchia|12 years ago

I also answered "about the same", and no I didn't notice the bookmark lookup (so-called on MSSQL), but even if I had I'm not sure I would have changed my answer--well maybe I would have because I would have noticed the "trick", but putting my test-taking adaptations aside....

It is already a highly selective query. Adding a 100 bookmark lookups will not cause a material change in performance, unless this query is being executed 1000s of times per second, in which case maybe your real problem lies elsewhere.

jcampbell1|12 years ago

> It is already a highly selective query.

Is it? The first query is always O(1). The worst case for the latter query is that it must aggregate over 999,910 rows.

Consider the case where all values of 'a' are 123, and all values 'b' are 42, except 90.

Guvante|12 years ago

My thought was akin to what he said, the system needs to grab all 100 rows to do the second filter.

I didn't think about the fact that the original one was an index-only scan, so went with "Roughly the same", since outside of that property the performance is similar.

Too bad there is no way to get good data on why people thought that way without a much longer quiz.

trhway|12 years ago

it has significant impact because first case is "take 100 rows from index" and the second is "take 100 rows from index _and_ for each row go to the row in the table - do the random IO and with 1M rows it is probably 1 IO/row - and check for the value of 'b'"

Such 100 random IOs will cost 0.5 sec on 1 iron platter HDD for example. So the query performance will degrade significantly until either the table is already preloaded into memory or you use SSD drives.

batbomb|12 years ago

> Such 100 random IOs will cost 0.5 sec on 1 iron platter HDD for example.

That's an incredibly, incredibly, iffy and mostly wrong statement which depends on arguably a corner case which doesn't often reflect reality (factors include which DBMS, row ordering, table size, cache size, block size, page size, RAM size, Hard Disk seek time, HDD throughput.

The only case where that's likely is performing a very cold query on a very large randomly distributed table once (and probably only once).

Even a table of 1 million rows with ~30B per row could easily be read into memory in about 300ms (100MB read time + ~5ms seek time, or ~= 5+(1e6*rowsize/ (100e3)) )

Query Optimizers do exactly this.

brianberns|12 years ago

I had the same thought. Seems like the optimizer could still perform an index-only scan to get to 100 rows, then go to the table to filter them down to 10 rows. Yes, the second step is extra, but should still be fast. What am I missing?

eterm|12 years ago

That was my reasoning when I answered, but I had missed the fact it was a GROUP BY, which means you can't just filter after the fact.

Edit: In other words it was 100 or 10 aggregated rows. A extra WHERE clause will change the values of each of the rows rather than just filter the rows from 100 to 10. (Which a HAVING clause would do.)

munimkazia|12 years ago

Yep, that's exactly what I felt too. The only question I got wrong in fact.