top | item 14633010

Diving into the world of hash tables

160 points| majikarp | 8 years ago |zeroequalsfalse.press | reply

100 comments

order
[+] VHRanger|8 years ago|reply
If you care about performance, it's important to note how the underlying representation of a dictionary data structure.

For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast. Retrieving a key is O(logn) but a very slow O(logn), because of the cache misses.

Other implementations is to keep a sorted vector of keys and a respective vector of values. There's loki::assocvector, booost::flat_unordered_map that do this for instance. Now insertion is slow, but iteration and retrieval are very fast (a fast O(logn) by binary search with few cache misses). It's also memory efficient, since no pointers between elements.

If you have one big dictionary you would use throughout an application, and know the data up front, a good strategy is to reserve the necessary memory in an array, fill it with keys/values, sort it once over the keys and coindex the value array. Now you have a memory efficient and extremely fast dictionary data structure.

One other strategy any intermediate coder can implement is to have two unsorted coindexed arrays. You don't even need a hash function for this. Now iterating and insertion through the table is extremely fast, and it is memory efficient, but finding a key is just a fast O(n). So this is good for smaller tables. In C++ you could implement it as a std::pair<vector<key>, vector<value>>. If you need a quick small map in a function this is often the fastest data structure you can implement without too many headaches.

[+] tjalfi|8 years ago|reply
Chandler Carruth talks about unordered_map performance[1] in his CppCon 2014 talk "Efficiency with Algorithms, Performance with Data Structures". He endorses most of the other alternatives you mentioned.

[1] https://youtu.be/fHNmRkzxHWs?t=46m39s

[+] obstinate|8 years ago|reply
> sorted vector of keys . . . reserve necessary memory and sort it . . . unsorted coindexed arrays . . .

Most of the things you mentioned are not hash tables, but members of a parent concept, dictionaries. Hash tables all by definition involve some sort of hashing of the key. The two main categories of hash table are chained hash tables (std::unordered_map does this, at least in the implementations I'm aware of) and open addressed hash tables, which use probing instead of secondary data structures to resolve conflicts.

[+] Eridrus|8 years ago|reply
It's an interesting fiction that we tell everyone that hash tables are O(1), when in reality hash operations are typically O(n) in the key size, which is O(log n) in the value space, it's just much cheaper to have your Own(log n) operations be reading sequential bits, rather than following pointers, reading whole keys, etc.

Probably not something people usually run into, but it does show that constants matter.

[+] ramchip|8 years ago|reply
> a good strategy is to reserve the necessary memory in an array, fill it with keys/values, sort it once over the keys and coindex the value array

This explanation confuses me. Is there one or two arrays? What does "coindex" mean?

[+] YZF|8 years ago|reply
You mean insertion into a bucket when there's a collision? And your n is the number of entries in the list? But n should always be very small relative to the total number of items in the hash map such that overall insertion and lookups are O(1).
[+] panic|8 years ago|reply
For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast.

Has this changed recently? Last time I used unordered_map, insertion was also slow because it had to allocate a linked list entry for every item in the hash table.

[+] emeraldd|8 years ago|reply
This is one of those data structures that everyone should try building at least once, kind of like linked lists, etc. Semi-unrelated, I built a couple of toy implementations a few years ago using the same basic ideas:

* This one is the my original written in something halfway resembling scheme: https://github.com/arlaneenalra/Bootstrap-Scheme/blob/master...

* And this implementation is part of a half finished byte code scheme that I haven't touched in a few years. Another project I need to get back to. Interface: https://github.com/arlaneenalra/insomniac/blob/master/src/in... Internals: https://github.com/arlaneenalra/insomniac/tree/master/src/li...

They were kind of fun to build and I'd recommend giving, especially if you have some skill but don't think you have the chops. To get a working toy isn't really all that hard once you understand the principals.

[+] rdtsc|8 years ago|reply
It is really a basic data structure in most modern languages. Either built on or part of the standard library.

I can see if this was published in late 90's but today, I don't know.

And just use the built-in ones don't invent your own unless there is a very good reason for it.

[+] dom0|8 years ago|reply
> And just use the built-in ones don't invent your own unless there is a very good reason for it.

Uhm, hash tables are one of these things where plenty can be gained by tailoring the structure to the application. It's not a one-size-fits-all, and it's not nearly as easy as it would seem to make a good general-purpose HT.

[+] zzzcpan|8 years ago|reply
Built-in implementations usually are kind of slow, waste memory, have broken iterators [1], even in modern languages. If either of those things are important - it's better to do some research and maybe even invent your own.

[1] when you can't both iterate and insert items consistently in the same loop

[+] taeric|8 years ago|reply
Hash Table has basically been elevated to a general idea lately. Pretty much anyone that is looking for an association between keys and values is using what they will call a hash table. To that end, few people actually know anything about how they are implemented. (And apologies if this comes across as negative, I do not mean it as a value judgement.)

This was different back when you would pick something like an alist to store associations. There the implementation stared you in the face. Same for the times you had to implement your own hash table. I don't exactly yearn for those days. Though, I am curious if alists actually win in speed for a large number of smaller hash tables.

[+] yxhuvud|8 years ago|reply
Depends on implementations of the alist and of the hash table. It is even fully possible that the small hash table will use an alist for the cases with few (and small) elements.
[+] hprotagonist|8 years ago|reply
"this is a data structure that is at the guts of basically every python object. Classes are dicts. Modules are dicts. Packages are probably dicts. Don't worry though, it's underrated!"

@_@

[+] dom0|8 years ago|reply
Python (and Ruby and some others) form the category of hash table interpreters—since most things are hash tables.
[+] draw_down|8 years ago|reply
I like both kinds of data structures- maps and lists.

(Just kidding, list is but a degenerate case of map)

[+] Cyph0n|8 years ago|reply
Quick question: are there special hash functions that are optimized for use in hash tables? Or do typical hash table implementations in e.g. Python just use standard hash functions like MD5?

Edit: It's 100% clear now. Thanks for the great answers everyone!

[+] robmccoll|8 years ago|reply
Typically the hash functions that you are familiar with due to cryptographic or data consistency use (SHA family, MD family, etc) do not make for good hash table choices because they produce hashes that are much larger than needed and are slow to compute so that they have better cryptographic properties (extremely low collision, no information leakage about inputs, difficulty of guessing inputs). When picking a hash function for a hash table, you want a function that makes a hash just big enough and with low enough collisions while still being fast and easily dealing with variable length keys. This could he something as simple as byte-wise XOR or addition with some shifting as you iterate the key followed by a mod or even bitwise AND mask to pick an index.
[+] gizmo686|8 years ago|reply
In order to use a standard hash function, you would first need to serialize the object. I am not aware of any programing language that does this.

Instead, the approach taken by (at least) Java and Python, is to define a "hash" function of objects, the classes can overwrite. The standard way of implementing such a function is to combine the hashes of the objects fields.

Python advises doing this by wrapping them in a tuple, and returning hash(self.a, self.b, ...). [1]

Java takes a simmilar approach, but does not make an explicit recomandation on how to implement hashCode(). In my experience, most programmers just XOR the hash of the fields, which (depending on the object) could be very sub-optimal, but is often good enough. Based on the doc, the typical implementation for Object.hashCode is to just take the memory address of the object.

[0] https://docs.python.org/3/reference/datamodel.html#object.__...

[1] Pythons tuple hash function may be found here: https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupl...

[2] https://docs.oracle.com/javase/7/docs/api/java/lang/Object.h...

[+] jwilk|8 years ago|reply
Contemporary versions of Python (≥ 3.4) use SipHash for hashing strings:

https://lwn.net/Articles/574761/

OTOH, the function for hashing integers in extremely simple:

  >>> hash(42)
  42
[+] hueving|8 years ago|reply
I recall reading at some point about Go having the option to use a call to AES on the crypto chip on the motherboard for fast high quality hashes, which is cool.
[+] booleandilemma|8 years ago|reply
"They didn't teach me about these in my coding bootcamp"
[+] sjroot|8 years ago|reply
"An underrated data structure"

Uh......

[+] relics443|8 years ago|reply
My data structures professor made a habit of proclaiming every few classes: "Give me a hash table, and I could rule the world"
[+] spankalee|8 years ago|reply
I wonder how hash tables are "underrated". They're probably _the_ data structure, if there is one.
[+] afinlayson|8 years ago|reply
I think 90% of technical interview questions result in hashtable in it. underrated wouldn't be the word I'd use.
[+] Izmaki|8 years ago|reply
No they're not. And if they are, they are more advanced than this brief overview.
[+] ribs|8 years ago|reply
This word "extremely" has me skeptical from sentence one. Maybe I'm jaded.
[+] mhh__|8 years ago|reply
wtf i love hash tables now! ;)
[+] mabbo|8 years ago|reply
I think when the author says "underrated", what they mean is "I didn't realize how important this is". Hash tables are used everywhere, by everyone, for a lot of things.

Maybe they don't give it enough time in school for people to realize it is the king of practical software development.

[+] shishy|8 years ago|reply
The author of the article does not say underrated at all from what I can see; just something the OP included in his title. Your point still stands, though.
[+] rectangletangle|8 years ago|reply
Nearly every expression in high-level languages relies on multiple hash lookups. This is part of the reason these languages are regarded as "slow." I suppose you could use a tree in place, and get reasonable performance. However the hash table's nature allows you to pretty much throw more memory at the problem in exchange for speed (though this is hardly unique to this particular data structure).

For instance `a = foo.bar.baz` in Python involves 3 hash gets (local scope, foo scope, then bar scope), and a single set operation (local scope). This is part of the reason Python programs can be optimized by assigning a deep attribute lookup to the local scope outside of a loop's scope, and it will yield improved performance relative to doing the deep attribute lookup inside the loop's scope.

  a = foo.bar.baz
  for _ in range(20):
      print(a)
vs

  for _ in range(20):
      print(foo.bar.baz)
[+] lsadam0|8 years ago|reply
Really. Hash tables are one of my most used data structures.
[+] Cyph0n|8 years ago|reply
Agreed. I mean, for example, doesn't the entire Lua language revolve around hash tables?
[+] dang|8 years ago|reply
OK, we changed the title to the first sentence of the article.
[+] fenomas|8 years ago|reply
Mods, can we get a title change? As I write this every comment is taking issue with the "underrated", which isn't claimed in the article.
[+] OJFord|8 years ago|reply
Though, in this case, changing it from the original does seem warranted. (This was just another bad choice...)
[+] pvg|8 years ago|reply
You can mail the mods and ask them, they usually fix these fairly quickly.
[+] deckarep|8 years ago|reply
It's actually sets that are underrated. All languages should come with a reference implementation.

Hashtables aka Maps aka Dictionaries aka Associative arrays are just fine.

[+] rdtsc|8 years ago|reply
Though once you have hash/maps/dicts set are with reach by making sets from hashes were keys are the element and values are just true or 1 or something like that.

But I think you probably meant having and using set operations effectively in day to day tasks, as in "make 2 sets and do a set different operation" instead of "do a for loop on first hash check if it is in the second, then put results in an accumulator.

Another thing is to think about set of sets. Can that be useful sometimes? Implementing that is slightly trickier. You'd need to be able to get a hash of a set. Python has frozenset https://docs.python.org/3/library/stdtypes.html#frozenset. I've used those on occasion.

Then of course there is Erlang sofs (sets of sets) module. Stumbled on it by accident. Oh my, it comes complete with an introduction to set theory and relational algebra:

http://erlang.org/doc/man/sofs.html

It just struck me as so out of place with the rest of the standard library modules. Would like to know its history

[+] ma2rten|8 years ago|reply
I would argue that Hashtables are a specific implementation of the abstract datatype "map" or "set".
[+] xyzzy4|8 years ago|reply
Sets are just like hashes where the value is always "true" for each key.