(no title)
markisus | 5 months ago
I expect that in this case, like in all cases, as the datasets become gallactically large, the O(n) algorithm will start winning again.
markisus | 5 months ago
I expect that in this case, like in all cases, as the datasets become gallactically large, the O(n) algorithm will start winning again.
Anon_troll|5 months ago
Also, memory access is constant time only to some upper limit allowed by the hardware, which requires significant changes to the implementation when the data does not fit the system memory. So, the hash algorithm will not stay O(n) once you go past the available memory.
The sorting algorithms do not suffer from these complexities quite as much, and similar approaches can be used with data sets that do not fit a single system's memory. The sorting-based algorithms will likely win in the galactically large cases.
Edit: Also, once the hash table would need to grow beyond what the hash function can describe (e.g. beyond 64 bit integers), you need to grow the function's data type. This is essentially a hidden log(n) factor, as the required length of the data type is log(n) of the maximum data size.
shiandow|5 months ago
Also if items take up k bytes then the hash must typically be O(k), and both the hashing and radix sort are O(n k).
Really radix sort should be considered O(N) where N is the total amount of data in bytes. It can beat the theoretical limit because it sorts lexicographically, which is not always an option.
zahlman|5 months ago
Basically, the hashed sorting approach is effectively actually O(n), and is so for the same reason that the "length of hash table" approach is. The degenerate case is counting sort, which is a single pass with a radix base of k. (Which is analogous to pointing out that you don't get hash collisions when the hash table is as big as the hashed key space.)
daemontus|5 months ago
One more fun fact: this is also the reason why Turing machines are a popular complexity model. The tape on a Turing machine does not allow random access, so it simulates the act of "going somewhere to get your data". And as you might expect, hash table operations are not O(1) on a Turing machine.
dgreensp|5 months ago
spott|5 months ago
Sorting (really scanning through an array) is very efficient in cache lines, while hash tables are very inefficient.
In the example, every memory access in hashing gets 8 byte of useful data, while every memory access in sorting gets 64 bytes of useful data.
So you have to be 8x more efficient to win out.
For this problem, the radix sort is log_1024(n) passes, which for a key size of 64 bits can’t ever get above 7 passes.
If you increase the key size then the efficiency win of sorting drops (128 bit keys means you have to run less than 4 passes to win).
Essentially the size of the key compared to the cache line size is a (hidden) part of the big O.
aDyslecticCrow|5 months ago
I could imagine the hash table wins again beyond a even greater threshold. Like what about 120GB and beyond?
ants_everywhere|5 months ago
So it comes down to the cost of hashing, hash misses, and other factors as discussed in the article.
I remember learning that radix sort is (almost?) always fastest if you're sorting integers.
A related fun question is to analyze hash table performance for strings where you have no bound on string length.
shoo|5 months ago
Here's my attempt at it, based on that analysis:
Suppose each item key requires s bytes
For the hash table, assuming s <= 64, our cache line size, then we need to read one cache line and write one cache line.
The bandwidth to sort one key is p(N) * 2 * s where p(N) is the number of passes of 1024-bucket radix sort required to sort N elements, and 2 comes from 1 read + 1 write per 1024-bucket radix sort pass
p(N) = ceil(log2(N) / log2(buckets))
Suppose N is the max number of items we can distinguish with an s=8 byte item key, so N = 2^64
then p(N) = ceil(64 / 10) = 7 passes
7 radix passes * (1 read + 1 write) * 8 bytes per item key = 112 bytes of bandwidth consumed, this is still less than the bandwidth of the hash table 2 * 64.
We haven't found the threshold yet.
We need to either increase the item count N or increase the item key size s beyond 8 bytes to find a threshold where this analysis estimates that the hash map uses less bandwidth. But we cant increase the item count N without first increasing the key size s. Suppose we just increase the key size and leave N as-is.
Assuming N=2^64 items and an item size of b=10 bytes gives an estimate of 140 bytes of bandwidth for sorting vs 128 bytes of bandwidth. we expect sorting to be slower for these values, and increasing b or N further should make it even worse.
(the bandwidth required for hash map hasnt increased as our 10 byte b is still less than the size of a cache line)
edit: above is wrong as i'm getting mixed up with elements vs items. elements are not required to be unique items. if elements are unique items then the problem is trivial. So N is not bounded by the key size, and we can increase N beyond 2^64 without increasing the key size s beyond 8 bytes.
keeping key size s fixed at 8 bytes per item suggests the threshold is at N=2^80 items
given by solving for N for the threshold where bandwidth estimate for sort = bandwidth estimate for hash table
2 * 8 * (log2(N) / log2(1024)) = 2 * 64
cdavid|5 months ago
[edit] I should have said "basic big O complexity" makes assumptions that break down. You can ofc decide to model memory access as part of "big O" which is a mathematical model
dgreensp|5 months ago
Ar-Curunir|5 months ago
simiones|5 months ago
However, in this particular case, the higher-complexity sorting algorithms gets better than the hash algorithm as N gets larger, even up to pretty large values of N. This is the counter-intuitive part. No one is surprised if an O(n log n) algorithm beats an O(n) algorithm for n = 10. But if the O(n log n) algorithm wins for n = 10 000 000, that's quite surprising.