top | item 42490362

(no title)

jschafer | 1 year ago

Note that the measurements in the paper were made before they fixed a bug where they confused bits and bytes. So SQLite only used 1/8 of the reserved bloom filter space, thus increasing the false positive rate significantly:

https://sqlite.org/src/info/56d9bb7aa63043f5

I found and reported the bug because I wanted to know how the bloom filters work in SQLite for my uni seminar paper. Still wondering if one can find those kind of bugs with test cases.

discuss

agilob|1 year ago

On top of that I don't think it's fair to say it's 10x faster when it btree was tested only on integer index primary key column. Benchmarks with that bold statements should include short string (1-16 chars maybe) and UUID indexes at least.

jschafer|1 year ago

I do not know if it is still the case, but the last time I looked into the source code SQLite did hash all strings to the exact same value.

So the bloom filter optimization does not work there.

It had to do with the different ways strings can be compared with collating functions, as strings may be equal even if they have different bytes: https://sqlite.org/forum/forumpost/0846211821

thaumasiotes|1 year ago

Why do you want a UUID index? Use an integer index and have the UUID in another column.

Scene_Cast2|1 year ago

It's also a problem in machine learning. Your data might be mangled due to a bug but the NN will still extract something useful out of it. Or, on the flip side, if you make a change to the data and things do break (learning stops converging), you never really know if it's the architecture or the data that's the issue.

ramraj07|1 year ago

How much of a slowdown did you estimate this bug caused?

jschafer|1 year ago

SQLite only knows nested loop joins and the bloom filter can just tell us "no need to do a join, there is definitely no matching entry".

If it has a false positive all the time (the worst case) then the performance is the same as before the bloom filter optimization was implemented (besides the small bloom filter overhead).

As the bloom filter size in SQLite directly depends on the table size I estimated a false positive rate of 63.2% due to this bug, while it could have been just 11.75%.

metadat|1 year ago

It actually would have performed faster, but the false positive rate drastically increased.

devoutsalsa|1 year ago

90%?