top | item 45210189

(no title)

Anon_troll | 5 months ago

The hash-based algorithm is only O(n) because the entry size has a limit. In a more general case, it would be something more like O(m(n * e)). Here n is the number of entries, e is the maximum entry size and m is a function describing how caching and other details affect the computation. With small enough data, the hash is very fast due to CPU caches, even if it takes more steps, as the time taken by a step is smaller. The article explains this topic in a less handwavey manner.

Also, memory access is constant time only to some upper limit allowed by the hardware, which requires significant changes to the implementation when the data does not fit the system memory. So, the hash algorithm will not stay O(n) once you go past the available memory.

The sorting algorithms do not suffer from these complexities quite as much, and similar approaches can be used with data sets that do not fit a single system's memory. The sorting-based algorithms will likely win in the galactically large cases.

Edit: Also, once the hash table would need to grow beyond what the hash function can describe (e.g. beyond 64 bit integers), you need to grow the function's data type. This is essentially a hidden log(n) factor, as the required length of the data type is log(n) of the maximum data size.

discuss

shiandow|5 months ago

Interestingly you need a hash function big enough to be unique for all data points with high probability, it doesn't take much to point out that this is at least O(log(n)) if all items are unique.

Also if items take up k bytes then the hash must typically be O(k), and both the hashing and radix sort are O(n k).

Really radix sort should be considered O(N) where N is the total amount of data in bytes. It can beat the theoretical limit because it sorts lexicographically, which is not always an option.