top | item 1860095

Naggum Nugget

55 points| signa11 | 15 years ago |groups.google.com | reply

54 comments

order
[+] kmavm|15 years ago|reply
True enough in 1999, and all but completely false now. For a competently implemented hash map with integer keys, hash tables start beating linear search somwhere between n = 2 and n = 4.

At n = 4, you start needing more than one cacheline to do your linear search. Since a worst-case cache miss is around 300 cycles on a Nehalem, a hash function is very, very cheap price for near-certainty that you'll only probe one line of the structure.

All of which makes the table-pounding tone of this post kind of funny from 11 years' hindsight. The crossover point between linear search and hashing is a result of hardware and software parameters, not rhetorical force. He was right when he wrote what he wrote, and is wrong now. Future hardware/software changes might make him right again, but might not. I've indulged in the "real programmers vs. quiche eaters" rhetorical flourish at times, too; this post is a welcome reminder of its risks. It looks better to posterity if you do the experiments, state the results, and walk away.

[+] hvs|15 years ago|reply
And based on the way he worded it, he would agree with you now. His point was that you need to apply both theory AND practice and not just blindly cite theory.
[+] nkurz|15 years ago|reply
"Completely false" might be an overstatement. Do you have any test results you could point to? Or at least an exact specification of what the test would consist of? Particularly, are you measuring the first run from a luke warm start, guaranteed to be in memory but not in cache? Or might you be reading a page or two from disk, or making multiple passes while it still might be in some level of cache? Because I think the details are going to make a difference as to whether the break even point is n=4 or n=400, or maybe even n=4000.

Like the decade-old Erik, my intuition is still that a brute force would be hard to beat for less than 100 integer keys to pointers if one is doing multiple passes through the list. But I'd love to bring my intuition up to date. Do you have one of these new-fangled 'competently implemented hash maps' that you could yield up so I can do some cycle counting? And is it fair if we throw in a few insertions and deletions just to make it realistic?

For record, I completely agree on irrelevance of the cycles to compute the hash once one might be hitting main memory. But I also think it's worth noting that the size of your L3 cache may be comparable to the all the RAM that Erik had in his machine. I also think you might be underestimating the power of a modern prefetch unit, whether hinted or automatic. But I'll bow to data. As you said very well: "do the experiments, state the results, and walk away."

[+] malkia|15 years ago|reply
You are ignoring the fact that you have to calculate the hash-key. With association tables, pointer comparison is enough if you are searching for interned symbols.
[+] bborud|15 years ago|reply
I don't see why this should look funny 11 years in hindsight. the important part is knowing rather than assuming. and you seem to assume quite a lot.
[+] StavrosK|15 years ago|reply
TL:DR: O(2n^2) complexity is better than O(5000n) for small n.

I believe the above to be an accurate summary.

[+] lisper|15 years ago|reply
In all the times I have seen this topic raised (and I've seen it raised a lot) I have yet to see anyone who engages in the debate actually do the experiment to determine where the actual breakeven point is. Hint: it's less than 5000.
[+] unwind|15 years ago|reply
... and using upper-case characters at the start of sentences is just silly.
[+] davidw|15 years ago|reply
I don't see him pin any actual numbers to things. It'd be interesting to see how things work out with different sized lists and hash tables.
[+] JoachimSchipper|15 years ago|reply
Pretty much, although O() notation doesn't really work with constants.
[+] exit|15 years ago|reply
another thing the world needs is people who can argue a point without an obnoxious "shut-down" mentality about it.

whatever points he's making, it feels like listening to bill o'reilly make them.

[+] zachbeane|15 years ago|reply
Obtaining useful information in contexts or from people you find unpleasant is a handy skill.
[+] gregfjohnson|15 years ago|reply
In order to improve the performance of hash tables, why not have the buckets be something like a red-black tree instead of a linked list? (You might improve the efficiency a little by spending a bit and saying "This bucket contains a single element and here it is", or "This bucket contains multiple elements and here is the root of the red-black tree".) Then, you would get log-time search instead of linear-time search when buckets start to fill up with multiple elements.
[+] Psyonic|15 years ago|reply
This exists, it's just generally not done because your buckets should be small enough that linear search is fastest.

From the wikipedia hash table article (http://en.wikipedia.org/wiki/Hash_table#Separate_chaining_wi...):

Instead of a list, one can use any other data structure that supports the required operations. By using a self-balancing tree, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, this approach is only worth the trouble and extra memory cost if long delays must be avoided at all costs (e.g. in a real-time application), or if one expects to have many entries hashed to the same slot (e.g. if one expects extremely non-uniform or even malicious key distributions).

[+] JabavuAdams|15 years ago|reply
Ouch! The cruel tutelage of Erik Naggum.

I miss that guy.

[+] mdemare|15 years ago|reply
tl;dr: Fancy algorithms are slow when n is small, and n is usually small. (Rob Pike).