Some guy called Ilan Schnell wrote a python code generator which uses the very same algorithm, and comes with language templates for Python, C and C++ (and is easy to expand to others), and a dot file generator so you can visualize the graph:
Ilan Schnell is not "some guy". He's the original primary author of the Anaconda distribution. One of the main reasons that so many data scientists use Python.
I think I'm missing something conceptually as to why perfect hashing is even needed in this case. Since postgres only has only about 450 or so keywords, shouldn't it suffice to just use a standard string hashing algorithm like:
int Hash(string x)
{
unsigned int h = 31;
for(int i = 0; i < x.length(); i++)
{
h = (h * 76991) ^ (x[i] * 77003);
}
return h;
}
Which I tested and seems to work fine with no collisions for the ~440 tokens postgres has. It's definitely not as theoretically fulfilling as a constructive way to generate perfect hashes, but unless they plan to add hundreds of new keywords in the future shouldn't standard hash functions work (after tweaking the constants to ensure that no collisions occur for their particular set of words)?
Binary search is a pretty bad algorithm. Sure, it runs in O(log n) time and it's more or less the best you can do to solve the general problem of finding something in a sorted list, but that does not mean it's a good way to create an immutable list of things that can be searched using comparisons.
In particular, binary search has awful cache locality. Until you get very close to your target, you are essentially hitting a random cache line (and maybe even a random page for large lists) for every comparison. This is bad. A B-tree-like structure that is never modified is very easy to make, and it will be much faster.
This came up with the ORC unwinder in Linux. We got something like a 4x speedup by changing from binary search to something a bit more clever to find unwind entries in the table.
> In particular, binary search has awful cache locality.
That makes me think of an idea which I'm certain can't be original, but I can't think of the jargon for what it's called.
Suppose you have a fixed-length sorted array, and its binary-search can be visualized as a tree. Traverse that tree in level-order and pop it into a helper-array, arranged much like a heap: The midpoint is first and all left and right children are proceed from there at offsets that you can predict.
So the original array [A,B,C,D,E,F,G] has a data-structure for searching of [(D,3), (B,1), (F,5), (A,0), (C,2), (E,4), (G,6)]
While the high-indexed items will get spread out, the most common and sequential activity occurs all together near the lower indexes, minimizing large jumps. Since each entry has the original index stored, you can always switch back to the initial array when a different strategy (e.g. linear scan) becomes preferable.
Actually cache locality might not be quite as bad as you think: the first few levels of the "pivot tree" will definitely be cached (although wasting a whole cache line on a single element is certainly not optimal use of the cache). To get better locality one could use a breadth-first layout of the implicit binary search tree, like a binary heap. One compromise could be to duplicate the first few levels of pivots at the beginning of the array, in breadth-first order, while storing the array in the usual sorted order. See "Array Layouts for Comparison-Based Search" for more details.
I came across this last night (http://dtrace.org/blogs/bmc/2018/09/28/the-relative-performa...) and went down the rabbit hole of Rust and its BTreeSet (http://cglab.ca/~abeinges/blah/rust-btree-case/). They came to a very similar conclusion. It all makes me wonder if there's low hanging fruit in a lot of languages/libraries to simply swap out red black trees for B-trees (though I'm guessing the higher constant time cost of B-trees would make it necessary to swap between tree implementations for a truly general case).
That's exactly what phk wrote about: theoretically perfect algorithms such as binary search are never practically perfect, because they never account for real life things like disk latency and memory pressure. This has lead to developers come up with inefficient softwares.
Very true. Here's a paper from last year that measures the performance of searches on several implicit search tree layouts on CPU and GPU: https://algoparc.ics.hawaii.edu/pubs/18-ipdps.pdf. You can see the results in figures 7 and 9 (pages 9 and 10).
tl;dr: in an array of 500 million elements, if you're doing more than ~5 million queries (1%), it pays off to spend the time to construct a more efficient representation of the data and search on that. On a GPU, you need ~15 million queries (3%).
SQLite already implemented such mechanism[0] that determines whether or not a given identifier is really an SQL keyword. The result is a single static hashtable (C array of chars) with O(1) lookup result. This is the fastest known implementation for keyword lookup.
I'm not clear as to why that'd be faster than what we/PG has done. On a quick skim that allows for conflicts in the hashtable, which our new implementation doesn't? Ours is also a C array (albeit of offsets into one blob of string srather than the strings directly, the higher data density achievable due to that was worth it).
The way I've seen most languages do it is something like this:
switch (c[0]) {
case 'c':
if (strcmp(c, "case") == 0) return CASE;
if (strcmp(c, "continue") == 0) return CONTINUE;
...
case 'f':
if (strcmp(c, "for") == 0) return FOR;
if (strcmp(c, "friend") == 0) return FRIEND;
...
}
I suspect that's good enough if you have a typical keyword list of say 20-100.
I just looked at Clang and it appears to build the list at runtime, using a regular hash table. I think this is because it has to handle C, C++, and Objective C, and they have different sets of keywords.
It seems like they could change it to use something like perfect hashing, and if that's even a 1% win, it would be worth it for the extra build complexity, since C and C++ compilation take up so many CPU cycles in the world.
See lib/Basic/IdentifierTable.cpp.
Although another problem is that the list of keywords depends on whether you're compiling C89, C99, C++98, C++11, etc. But that is a finite list so it seems like it still can be done with code generation ahead of time like PostGres does.
In Oil [1] I use re2c [1] recognize keywords, along with the rest of the complex lexing process, and it works well. It builds a DFA expressed in C code. The slower parts are elsewhere :)
EDIT: In theory, the DFA seems better than perfect hashing because you can bail on the second character if you get "xzoo" and no keyword starts with "xz". I think with perfect hashing yo still have to look at every byte.
The switch statement above also avoids looking at every byte in certain cases.
You don't necessarily have to look at every byte to compute the hash - gperf will generate hash functions that uses only a few characters at seemingly random offsets. You do however have to do a full strcmp() to rule out a hash collision (with words outside of your set) so DFA wins there.
Imho though, the most significant difference between DFA and a PHF would be the DFA would likely generate code that introduces a lot of branches and control flow, which your CPU branch predictor might have a harder time with than a series of arithmetic ops and a single table lookup.
By the way, thanks for oilshell. I don't use it, but your blog is a superb resource on parsing techniques.
There's a lot of overlap with those keywords, so perhaps the best bet is to create a single superset and use an attribute returned by the lookup to see if it's needed by the target language.
We recently started using perfect hashing in V8's parser, also with measurable performance improvement (even over what we had previously, a first character jump table followed by linear search). It's always nice to apply that computer science education to a real world problem.
They changed one textbook algorithm by another one. It turns out to be faster, but it might as well be slower for certain inputs (which might be common). Also, when they increase the number of keywords in the future, they might have to reconsider the implementation.
Reading the discussion thread is fantastic. The postgres community culture definitely feels a lot different than some of the other open source projects I've seen.
let token_raw = "FROM";
let index = keywords.binary_search(token_raw);
let token = tokens[index];
As a binary search, that’s O(log N) string comparisons. In this worst case, it may try comparing with INSERT, then AS, then FROM before deciding index is 1, but for more keywords you have more comparisons.
With perfect hashing, you devise a single hash function that hashes all possible values uniquely. Now we have something like this:
… with all the numbers calculated by hashing the strings at compile time. Then, instead of O(log N) string comparisons, you only need one hash and one hash table lookup:
let token_raw = "FROM";
let token = tokens[hash(token_raw)];
A perfect hash is one in which their are no collisions for a given data set, which guarantees O(1), or constant, lookup time. This patch introduces a perl script which computes a perfect hash function at build time given the current set of SQL keywords.
The old keyword lookup method uses binary search. Binary search is a search algorithm that must compare at most O(log n) entries before finding a match.
Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.
It seems that this is a special case of hashing, indicated by the word 'perfect', where it somehow (I did not read the article and I don't know how this is done) generates a guaranteed collision-free hash table. While you can do the same with cryptographic hash functions, I'm sure that those are much slower than what is implemented as non-cryptographically secure hash function in databases.
That remind me, some years ago, I’d created a pull request for redis to use perfect hashing with gperf to parse commands. IIRC, the author turned it down because he was wary of adding the build time dependency on gperf.
That's awkward, because the gperf dependency is only needed every 2 years or so, when a new keyword is added. The genererated C code needs to be added to git, and only needs a regen if changed. I do that all the time via gperf or via some perl-generated perfect hash tables, eg for static keywords, Config tables, unicode tables, ...
The memory win is always dramatic, like 20x better typically.
I also use some standard gperf-fixup scripts for easier inclusion into my code.
Yes, you can, but this approach is better. Pearson would beat CHM on very small CPU's (8-16bit) and it is not guaranteed to find a good permutation. With CHM it's much easier.
Most important thing is to store the keywords for the false positive verification in an continuous const char const* array.
Since there is a lot of discussion about alternative solutions to the keyword lookup problem I was wondering, why Tries were not mentioned yet.
Since the current implementation has to do a string compare after the hash, wouldn't a Trie be at least as fast? Am I missing something? Are there other disadvantages?
For the trie, you still would have to do multiple character comparisons (on level 1, on average about 13, I guess, given that I read there are 400 keywords, and 26 characters in the alphabet).
Presumably, hashing the string is faster. I think that
partly is because it introduces fewer cache misses than walking a trie. Both a this and a trie have to walk the input string, but code using a trie jumps around the trie, while this code only compare to the keyword, stored in contiguous bytes.
Another reason might be that trie-based code is worse for branch prediction logic.
Someone else can probably comment better but hashing seems more efficient memory wise than holding the trie in memory, and hashing might probably also better cache-wise compared to jumping between the nodes of the trie.
[+] [-] nly|7 years ago|reply
http://ilan.schnell-web.net/prog/perfect-hash/
He also has a superb illustrated explanation of the algorithm, a unique property of which is you get to choose the hash values:
http://ilan.schnell-web.net/prog/perfect-hash/algo.html
I've been using it for years.
[+] [-] onalark|7 years ago|reply
[+] [-] krackers|7 years ago|reply
[+] [-] amluto|7 years ago|reply
In particular, binary search has awful cache locality. Until you get very close to your target, you are essentially hitting a random cache line (and maybe even a random page for large lists) for every comparison. This is bad. A B-tree-like structure that is never modified is very easy to make, and it will be much faster.
This came up with the ORC unwinder in Linux. We got something like a 4x speedup by changing from binary search to something a bit more clever to find unwind entries in the table.
[+] [-] Terr_|7 years ago|reply
That makes me think of an idea which I'm certain can't be original, but I can't think of the jargon for what it's called.
Suppose you have a fixed-length sorted array, and its binary-search can be visualized as a tree. Traverse that tree in level-order and pop it into a helper-array, arranged much like a heap: The midpoint is first and all left and right children are proceed from there at offsets that you can predict.
So the original array [A,B,C,D,E,F,G] has a data-structure for searching of [(D,3), (B,1), (F,5), (A,0), (C,2), (E,4), (G,6)]
While the high-indexed items will get spread out, the most common and sequential activity occurs all together near the lower indexes, minimizing large jumps. Since each entry has the original index stored, you can always switch back to the initial array when a different strategy (e.g. linear scan) becomes preferable.
[+] [-] senderista|7 years ago|reply
[+] [-] chug|7 years ago|reply
[+] [-] rakoo|7 years ago|reply
[+] [-] alecco|7 years ago|reply
Binary search with heap order is cache friendly and much faster. And can speed up with SIMD. Like Intel’s FAST algorithm.
[+] [-] lorenzhs|7 years ago|reply
tl;dr: in an array of 500 million elements, if you're doing more than ~5 million queries (1%), it pays off to spend the time to construct a more efficient representation of the data and search on that. On a GPU, you need ~15 million queries (3%).
[+] [-] symisc_devel|7 years ago|reply
[0]: https://sqlite.org/src/artifact/1f7f2ac1d9f262c0
[+] [-] anarazel|7 years ago|reply
[+] [-] koolba|7 years ago|reply
[+] [-] anarazel|7 years ago|reply
[+] [-] chubot|7 years ago|reply
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...
The way I've seen most languages do it is something like this:
I suspect that's good enough if you have a typical keyword list of say 20-100.I just looked at Clang and it appears to build the list at runtime, using a regular hash table. I think this is because it has to handle C, C++, and Objective C, and they have different sets of keywords.
It seems like they could change it to use something like perfect hashing, and if that's even a 1% win, it would be worth it for the extra build complexity, since C and C++ compilation take up so many CPU cycles in the world.
See lib/Basic/IdentifierTable.cpp.
Although another problem is that the list of keywords depends on whether you're compiling C89, C99, C++98, C++11, etc. But that is a finite list so it seems like it still can be done with code generation ahead of time like PostGres does.
In Oil [1] I use re2c [1] recognize keywords, along with the rest of the complex lexing process, and it works well. It builds a DFA expressed in C code. The slower parts are elsewhere :)
EDIT: In theory, the DFA seems better than perfect hashing because you can bail on the second character if you get "xzoo" and no keyword starts with "xz". I think with perfect hashing yo still have to look at every byte.
The switch statement above also avoids looking at every byte in certain cases.
[1] http://www.oilshell.org/
[2] http://re2c.org/
[+] [-] nly|7 years ago|reply
Imho though, the most significant difference between DFA and a PHF would be the DFA would likely generate code that introduces a lot of branches and control flow, which your CPU branch predictor might have a harder time with than a series of arithmetic ops and a single table lookup.
By the way, thanks for oilshell. I don't use it, but your blog is a superb resource on parsing techniques.
[+] [-] caf|7 years ago|reply
[+] [-] mark-r|7 years ago|reply
[+] [-] imaginenore|7 years ago|reply
[deleted]
[+] [-] Leszek|7 years ago|reply
[+] [-] miclill|7 years ago|reply
[+] [-] sjroot|7 years ago|reply
"The time savings is indeed significant: preliminary testing suggests that the total time for raw [SQL] parsing (flex + bison phases) drops by ~20%."
Very impressive work!
[+] [-] amelius|7 years ago|reply
They changed one textbook algorithm by another one. It turns out to be faster, but it might as well be slower for certain inputs (which might be common). Also, when they increase the number of keywords in the future, they might have to reconsider the implementation.
[+] [-] tachang|7 years ago|reply
[+] [-] canadev|7 years ago|reply
[+] [-] chrismorgan|7 years ago|reply
Imagine a language with five keywords, for simplicity.
Now you have a token that you need to interpret: As a binary search, that’s O(log N) string comparisons. In this worst case, it may try comparing with INSERT, then AS, then FROM before deciding index is 1, but for more keywords you have more comparisons.With perfect hashing, you devise a single hash function that hashes all possible values uniquely. Now we have something like this:
… with all the numbers calculated by hashing the strings at compile time. Then, instead of O(log N) string comparisons, you only need one hash and one hash table lookup:[+] [-] p4lindromica|7 years ago|reply
The old keyword lookup method uses binary search. Binary search is a search algorithm that must compare at most O(log n) entries before finding a match.
Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.
[+] [-] innagadadavida|7 years ago|reply
[+] [-] alnsn|7 years ago|reply
Note that PostgreSQL code uses 2-graph instead of a more compact 3-graph version. The latter is noticeably more compact: 1.25 vs 2.1 factor.
[+] [-] Tostino|7 years ago|reply
Very interesting.
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] jnwatson|7 years ago|reply
[+] [-] zzzcpan|7 years ago|reply
[+] [-] lucb1e|7 years ago|reply
[+] [-] boshomi|7 years ago|reply
[1] https://www.youtube.com/watch?v=87pUZYERkNQ (youtube)
[+] [-] robustpastoral|7 years ago|reply
[+] [-] woadwarrior01|7 years ago|reply
[+] [-] rurban|7 years ago|reply
[+] [-] senderista|7 years ago|reply
[+] [-] rurban|7 years ago|reply
Most important thing is to store the keywords for the false positive verification in an continuous const char const* array.
[+] [-] w23j|7 years ago|reply
Since the current implementation has to do a string compare after the hash, wouldn't a Trie be at least as fast? Am I missing something? Are there other disadvantages?
[+] [-] Someone|7 years ago|reply
Presumably, hashing the string is faster. I think that partly is because it introduces fewer cache misses than walking a trie. Both a this and a trie have to walk the input string, but code using a trie jumps around the trie, while this code only compare to the keyword, stored in contiguous bytes.
Another reason might be that trie-based code is worse for branch prediction logic.
[+] [-] krackers|7 years ago|reply
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] sheerun|7 years ago|reply