top | item 45416835

(no title)

Ameo | 5 months ago

The main takeaway from this for me is that SQLite’s query planner seems to be pretty limited. It’s reliant on stuff like the order in which WHERE conditions are specified, isn’t able to use multiple indexes in queries in many cases, bails out to scans when a variety of different operations show up in queries, etc.

It might be the case that SQLite has a simpler or less sophisticated query planner than other databases like Postgres or MariaDB, but in my experience those DBs struggle a lot with good querying planning as well. I’ve spent many hours in the past with issues like Postgres suddenly starting to ignore an index entirely because its computed table data distribution statistics got out of balance, or having to add manual annotations to MariaDB queries like STRAIGHT_JOIN in order to get a query to run faster.

I’m guessing that this is a really hard problem since it doesn’t seem to be really “solved” by any major DB vendor I’ve seen. A lot of modern DB engines like Clickhouse tend to just work around this problem by being so fast at full table scans that they don’t even need any sophisticated indexing set up at all.

discuss

order

crazygringo|5 months ago

> The main takeaway from this for me is that SQLite’s query planner seems to be pretty limited.

This doesn't appear to be true at all.

The order of WHERE conditions does not matter; the order of columns in an index does.

Everything you're describing is pretty much just how indexes fundamentally work in all databases. Which is why you're saying it hasn't been "solved" by anyone.

Indexes aren't magic -- if you understand how they work as a tree, it becomes very clear what can be optimized and what can't.

It is true that occasionally query planners get it wrong, but it's also often the case that your query was written in a non-obvious way that is equivalent in terms of its formal results, but is not idiomatic -- and making it more idiomatic means the query planner can more easily understand which indexes to use where.

Ameo|5 months ago

(copying my reply from the other comment that said the same thing as you)

The order of conditions in a WHERE definitely does matter, especially in cases where the conditions are on non-indexed columns or there are CPU-intensive search operations like regex, string ops, etc.

I just ran this test locally with a table I created that has 50 million rows:

``` » time sqlite3 test.db "select count() from test WHERE a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f' AND f = 'g'" sqlite3 test.db 5.50s user 0.72s system 99% cpu 6.225 total » time sqlite3 test.db "select count() from test WHERE f = 'g' AND a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f'" sqlite3 test.db 1.51s user 0.72s system 99% cpu 2.231 total ```

The only difference is swapping the `f = 'g'` condition from last to first. That condition never matches in this query, so it's able to fail fast and skip all of the work of checking the other conditions.

SkiFire13|5 months ago

> A lot of modern DB engines like Clickhouse tend to just work around this problem by being so fast at full table scans that they don’t even need any sophisticated indexing set up at all.

There's only so much you can do with this approach due how to the algorithmic complexity scales as more joins are added. At some points you'll need some additional data structures to speed things up, though they not be indexes in name (e.g. materialized views)

setr|5 months ago

Clickhouse isn’t fast at table scans, it’s just columnar. Indexes are basically a maintained transform from row storage to column storage; columnar databases are essentially already “indexed” by their nature (and they auto-apply some additional indexes on top, like zone maps). It’s only fast for table-scans in the sense that you probably aren’t doing a select * from table, so it’s only iterating over a few columns of data, whereas SQLite would end up iterating over literally everything — a table-scan doesn’t really mean the same thing between the two (a columnar database’s worst fear is selecting every column; a row-base database wants to avoid selecting every row)

Their problem is instead that getting back to a row, even within a table, is essentially a join. Which is why they fundamentally suck at point lookups, and they strongly favor analytic queries that largely work column-wise.

emschwartz|5 months ago

Just to clarify one thing: the order of WHERE conditions in a query does not matter. The order of columns in an index does.

Ameo|5 months ago

It definitely does matter, especially in cases where the conditions are on non-indexed columns or there are CPU-intensive search operations like regex, string ops, etc.

I just ran this test locally with a table I created that has 50 million rows:

``` » time sqlite3 test.db "select count() from test WHERE a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f' AND f = 'g'" sqlite3 test.db 5.50s user 0.72s system 99% cpu 6.225 total » time sqlite3 test.db "select count() from test WHERE f = 'g' AND a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f'" sqlite3 test.db 1.51s user 0.72s system 99% cpu 2.231 total ```

The only difference is swapping the `f = 'g'` condition from last to first. That condition never matches in this query, so it's able to fail fast and skip all of the work of checking the other conditions.