top | item 22460107

(no title)

jakoblorz | 6 years ago

Yet you pay the cost in speed when using a database engine, there is a big overhead.

discuss

Meh.

This query is just a key lookup, so the response time is likely to be well under 10ms even if you're not trying very hard to be efficient.

If using a stored procedure or a prepared query with persistent connections it'll be most likely be under 1ms if the data is stored on SSDs or in-memory. In this binary format, you need about 6 GB.

For some scenarios such as small cloud web servers, fast storage or lots of ram may not be viable. So if you're storing this on mechanical drives, the random nature of hashes means that for practically all queries the db engine will be walking the B-Tree pages and then the random I/O seek latency will dominate.

Some quick back-of-the-envelope maths: You can fit roughly 300 hashes into a typical 8KB page, as used by MS SQL Server as a random example. That means it'll build a 4-level B-Tree index for the HIBP database.

Unless you have less than 1 GB of memory, the first 2 index levels will become cached, leaving 2 random seeks for each lookup. At a typical 3-5ms, this is about 6-10ms of disk I/O latency, which will likely dwarf all other overheads.

Now keep in mind that the OP was trying to optimise a workflow that took 33 seconds originally!

Using a proper database has a ton of other benefits. For example, it becomes trivial to fit into modern asynchronous web application programming frameworks as yet another async call. Literally 1 line of code using Dapper or something along those lines.