top | item 4679756

Sorting 1 million 8-digit decimal numbers in 1MB of RAM

144 points| joslin01 | 13 years ago |stackoverflow.com

81 comments

order
[+] ChuckMcM|13 years ago|reply
This was a Google interview question for a while (I know I got it as one) I think it has since made it to the banned question list. The 'trick' was, as answer #2 elided, to use bits to store the fact that you had observed the number and then you could dump them back in sorted order.

So imagine you have a bit field where 'one bits' indicate you have seen the number and 'zero bits' indicate you haven't. And you compress it with run-length encoding.

Your initial data structure is '99999999:0' (all zeros, haven't seen any numbers) and then lets say you see the number 3,866,344 so your data structure becomes '3866343:0,1:1,96133654:0' as you can see the numbers will always alternate between number of zero bits and number of '1' bits so you can just assume the odd numbers represent 0 bits and the even numbers 1 bits. This becomes (3866343,1,96133654)

Anyway, as you get numbers you will split and coalesce the bit space until you've either seen all numbers (0, 99999999) or you run out of memory because a hacker has sent you only the even numbers from the space.

Its a clever 'math trick' which explores how you think about numbers (are they a thing or a representation). But I never thought it really gave a good indication of whether or not you would be a good candidate for the company.

[+] ChuckMcM|13 years ago|reply
So as it turns out, this is wrong. Several folks have mentioned that in that the original Google interview question is 'sort the numbers' / 'remove duplicates' and the solution is to create a bitmap in a disk file to do a single pass sort. This question mixes that up a bit and my particular bitmap solution fails to meet the criteria of staying under a MB of memory.

With all the discussion I decided to code up a quick version in perl. The code is pretty straight forward, start with the tuple (99999999,0) and on each number do one of three actions:

1) Break the number representing a string of zeros into two segments.

2) Extend the length of an existing region of one bits.

3) Collapse two regions of one bits separated by a single zero bit.

Now the two key to the reasons it doesn't work are that one, it takes two 32 bit numbers to identify non-adjacent numbers, and two, for a uniform distribution random number generator the ratio of a million numbers to a space of 100 million means that most of the numbers won't be adjacent. So each 1 bit will cost 8 bytes to store. A million numbers * 8 bytes is 8 million bytes (7.6MB if you use 2^20th as 1MB). Its fun watching the algorithm because it gets slower and slower (its doing a linear search of segments to insert 'bits' and memory goes up and up, until you have about 50M uniques and then it starts getting faster and faster again as it collapses more and more segments.

Storing it on disk with an uncompressed bitmap, runs in O(n) time, and of course you 12,500,000 bytes, just about 12MB (to count multiples you need to multiply that by the number of bits you want to reserve per-number) but doing it in memory only requires a better compression algorithm than simple run-length encoding.

[+] jules|13 years ago|reply
It is extremely unlikely that you'll be able to sort a random sequence of 1 million 10 digit numbers like that, or ANY method for that matter, unless the particular sequence is highly compressible. You won't even be able to store the answer in the average case. You'll need 10^6*lg(10^10) bits to store the raw input, from which you'll be able to save about lg(10^6!) bits because it's sorted. That comes out to more than 1.7 megabytes.

Edit: the exact amount of bits necessary is ceil(log_2((1e10+1e6-1) choose 1e6)), which is ~1.756 megabytes.

Edit edit: see udiv's comment below: that 1e10 should be 1e8 of course.

[+] victorhn|13 years ago|reply
How does that handle duplicate elements?
[+] alecco|13 years ago|reply
Thanks for the context.

So that solution does not cover all possible cases. Lovely of Google to ask that question.

Edit: not only that, the running time of Google's "solution" would be terrible, aren't there other possible data sets not covered or am I missing something?

[+] 055static|13 years ago|reply
Maybe they're just looking for improvements in the way they do sorting? Asking questions like these in interviews might be a way to find people who could teach their code monkeys new tricks?

Or it might have just been some silly way to eliminate candidates for arbitrary reasons. The usual.

Anyway, sorting is a big deal I think. If you can do it faster even by just a little bit than everyone else, that's a competitive advantage. Just my personal opinion.

[+] Evbn|13 years ago|reply
In thought the popular Google version allowed you to use secondary storage, and the challenge was to be efficient in how often you swapped.
[+] Daniel_Newby|13 years ago|reply
"The list of numbers may contain duplicates, which I must not discard."
[+] codex|13 years ago|reply
All of these bizzaro explanations leave me wondering if I've missed something. Can't this be elegantly solved via the pigeonhole principle?

There are one million numbers to be sorted, out of a space of one hundred million. That means that no number may be more than 100 away from any other number once fully sorted (7 bits). Therefore, you can simply use deltas to encode your numbers, as 7 bit deltas * one million numbers < 1 MB RAM.

EDIT: should've been clearer: no number may be more than 100 away from any other number on average once fully sorted. Therefore, it's an average of 7 bits per number, maximum. Duplicates are even easier, since it's only one bit to encode the duplicate (a delta of zero).

EDIT 2: As for the encoding to be used, I think a universal code or Golomb code would probably be sufficient. They can get quite close to natural entropy.

[+] tolmasky|13 years ago|reply
Duplicates are allowed, so you could for example have 500,000 zeros and 500,000 99,999,999's, which are more than 100 apart.
[+] hcles|13 years ago|reply
> That means that no number may be more than 100 away from any other number once fully sorted (7 bits) It's possible that 90% of numbers are <1 million, and the remaining 10% are > 91 million. That's a 90 million delta possibility.
[+] wangg|13 years ago|reply
Hmm, that doesn't follow, I can have 0, and from 99,000,002 - 100M.
[+] damienkatz|13 years ago|reply
Since no computational bound has been placed, this problem could be solved in n^2 by an insertion sort and keeping the list of numbers sorted in memory as they are received. Then the problem then boils down to encoding a list of sorted 8 digit decimal #s, where it's possible to insert new #s.

Since the #s are stored sorted and bounded in size, they can be encoded as deltas which will be more space efficient than storing absolute values. Now we just need to figure out the worst case encoding and will 1 million values fit?

[+] andrewcooke|13 years ago|reply
i think that's the clearest answer i've seen. really, it's pretty much equivalent to the other answers, but at least for me it's by far the easiest to grasp. although i guess that may be because the idea is simply becoming familiar after reading the others.

the best "high level" explanation, i think, is that you are compressing the sorted numbers, which are therefore not random, and so concerns about the incompressibility of random streams are completely irrelevant.

[+] eru|13 years ago|reply
Merge sort would work better. But yes, the secret sauce is in the compression, not the sorting.
[+] uvdiv|13 years ago|reply
Information theory to the rescue! Well, not really.

8 decimal digits takes 26.6 bits. An ordered list of 10^6 of these takes 3.17 MiB. The information contained in the ordering is lg(10^6!) ~= 2.20 MiB [0]. So as an unordered multiset, the information content is 0.96 MiB. It's at least theoretically possible to store the input in 1 MiB of working memory. But only just; in fact it's significant that the problem specifies 2^20 bytes, because for an SI megabyte (10^6 bytes), it wouldn't work.

I don't think it's actually possible though. The answers here don't do it. LZMA accomplishes nothing.

[0] Stirling's approximation in base 2: lg(n!) ~= n lg(n) - n/ln(2)

[+] dpark|13 years ago|reply
> The information contained in the ordering is lg(10^6!)

Can you point me to a resource that discusses the information theory behind this claim? I'm interested in learning more, but don't know what to search for.

[+] ralfd|13 years ago|reply
> So as an unordered multiset, the information content is 0.96 MiB

Could you please explain this more or point to other links?

[+] eru|13 years ago|reply
The second answer does it.
[+] mikeash|13 years ago|reply
Since code size is not restricted, the obvious solution is to store the entire list of numbers received in the program counter. In short, put every possible input combination into the code, and just follow whatever branch pattern you receive.

This requires no memory other than that for the networking stack. It is, of course, also completely impractical.

[+] curiousdannii|13 years ago|reply
This was my thought as well. But if you can afford several TB of ROM you can probably afford to increase your RAM to 4MB.
[+] tolmasky|13 years ago|reply
EDIT: I put this same description into the stackoverflow page with an image which might help visualize it.

I think one way to think about this is from a combinatorics viewpoint: how many possible combinations of sorted number orderings are there? If we give the combination 0,0,0,....,0 the code 0, and 0,0,0,...,1 the code 1, and 99999999, 99999999, ... 99999999 the code N, what is N? In other words, how big is the result space?

Well, one way to think about this is noticing that this is a bijection of the problem of finding the number of monotonic paths in an N x M grid, where N = 1,000,000 and M = 100,000,000. In other words, if you have a grid that is 1,000,000 wide and 100,000,000 tall, how many shortest paths from the bottom left to the top right are there? Shortest paths of course require you only ever either move right or up (if you were to move down or left you would be undoing previously accomplished progress). To see how this is a bijection of our number sorting problem, observe the following:

You can imagine any horizontal leg in our path as a number in our ordering, where the Y location of the leg represents the value (image: http://i.stack.imgur.com/aJp4b.png ). So if the path simply moves to the right all the way to the end, then jumps all the way to the top, that is equivalent to the ordering 0,0,0,...,0. if it instead begins by jumping all the way to the top and then moves to the right 1,000,000 times, that is equivalent to 99999999,99999999,..., 99999999. A path where it moves right once, then up once, then right one, then up once, etc to the very end (then necessarily jumps all the way to the top), is equivalent to 0,1,2,3,...,999999.

Luckily for us this problem has already been solved, such a grid has (N + M) Choose (M) paths:

(1,000,000 + 100,000,000) Choose (100,000,000) ~= 2.27 * 10^2436455

N thus equals 2.27 * 10^2436455, and so the code 0 represents 0,0,0,...,0 and the code 2.27 * 10^2436455 and some change represents 99999999,99999999,..., 99999999.

In order to store all the numbers from 0 to 2.27 * 10^2436455 you need lg2 (2.27 * 10^2436455) = 8.0937 * 10^6 bits.

1 megabyte = 8388608 bits > 8093700 bits

So it appears that we at least actually have enough room to store the result! Now of course the interesting bit is doing the sorting as the numbers stream in. Not sure the best approach to this is given we have 294908 bits remaining. I imagine an interesting technique would be to at each point assume that that is is the entire ordering, finding the code for that ordering, and then as you receive a new number going back and updating the previous code. Hand wave hand wave.

[+] kamaal|13 years ago|reply
I think the answers to most of these seemingly impossible questions depends on discovery of isomorphic problems spaces where they can be solved a lot more easily.

In your case you translated the problem to a different problem space and solved it there. The other contributors in stack overflow tend to do the same eg: Using network latency, compression etc to solve these problems.

These sort of solutions become very interesting when they become isomorphic to some other real world problems.

[+] saucerful|13 years ago|reply
In case anyone is interested in how to prove "(N + M) Choose (M) paths":

Any such path ends at (N,M). Such a path can be represented as a sequence of N+M bits, 0=right, 1=up, where there are exactly M ups. So choosing a path is identical to choosing the positions of the M ups, thus there are (N+M) choose (M) such paths.

EDIT: My first proof was needlessly complicated because it dealt with paths that ended at arbitrary (N,m) but really you can just let the paths go all the way up to (N,M)-- even if the max value in the list is m < M-- and just ignore the last part of the path. It's still a bijection.

[+] greendestiny|13 years ago|reply
Yeah I think if you number the permutations in order from smallest to largest (ie all 0 to all 99999999) then you can update as new numbers come in by keeping track of the number (N) you've already seen and assume the other (million - N) are 0. So you just have your 10^6 bit number representing the sorted list of numbers and an 27bit integer representing the number of samples seen. The update function might be a bit ugly though :)

It's probably actually a question of whether the update function could be done in the remaining memory - it certainly couldn't unpack the whole representation.

[+] spitfire|13 years ago|reply
Radix sort. Done.

Depending on his dataset characteristics a radix sort can have a space requirement as low as a few hundred bytes to sort several million values.

EG: 8-bit values, Simply make a 256byte array. Increase the appropriate count on each value when you see a value. When you've gone through the list, loop through the array outputting count values at that index. It's also quite cache friendly, mind the last time I compared was on a Pentium PRO to quick sort.

For larger datasets, you actually want to compare on the digits (LSD or MSD first), and that'll take more memory.

EDIT: Originally posted that it'd take 256 bytes of memory. That's not true for his dataset.

[+] danbruc|13 years ago|reply
There is not enough memory to keep all the numbers and therefore radix sort is no viable solution. But I think you actually meant counting sort, but this will not work either. It requires 100,000,000 slots each 20 bits wide for a total of almost 238.5 MiB.
[+] mfukar|13 years ago|reply
> Radix sort. Done.

Got code?

[+] optimiz3|13 years ago|reply
As others have mentioned, you need 7 bits per number (on average) if you store the numbers as deltas in sorted form. So 7M bits out of 8388608 bits yields 1.32MB of working set.

One could implement a simple block allocator, where each block contains a sequential list of deltas.

The trick to fast insertion is to place new blocks at addresses interpolated between 0 and 10^8. If there is a collision, merge blocks. If the input distribution is off, physically reallocate colliding blocks left or right into free space.

So inserting the numbers 10, 20, 1000, 2000, 1M, 2M would give you a heap looking like:

[head->[10,10]->[980]->[1000]->[998000]->[1000000]->tail]

As more numbers are inserted, blocks combine until you end up with one contiguous block.

[+] jedberg|13 years ago|reply
Am I the only one who thinks this was someone's homework assignment?
[+] kevinburke|13 years ago|reply
It does sound similar to the opening problem in the book "Programming Pearls." The solution proposed there was to write temporary files to disk and do a K-ways sort.
[+] mmaunder|13 years ago|reply
Compressed buckets of number ranges. I love it because it's incredibly easy to understand which makes the code easy to maintain, and it works.
[+] carl8|13 years ago|reply
Here's the solution that I came up with prior to following the link or reading any comments.

We're given 1 million integers from 0 to 99,999,999. We only have 1MB of RAM, or an average ~8 bits for each of the million numbers. So we can't store the numbers directly since they take ~27 bits each.

First thought was to use a bitset but that would require 100 million bits, and we only have ~8 million bits RAM, so that's not going to work. Also need to deal with duplicates.

How about this. Something similar to a selection sort algorithm that stores deltas of distances between sorted numbers. As a number is streamed in, we start scanning from the beginning of the list until it's correct position found, where it is inserted and then push down the remaining numbers. This will be O(n^2).

Since the average delta distance between numbers is about 100, we'll use 8-bits to store the delta value. Value 0 means the number is a duplicate of the current number. Values 1-254 mean add this number to the current number for the new number. Value 255 means add 255, then use the next byte as the delta value (repeat until the value != 255).

(Case 1) 1 million ints exactly 100 apart: 0, 100, 200, 300, 400, ..., 99999800, 99999900 Stored as a list of 8-bit delta values: 0, 100, 100, 100, 100, ..., 100, 100 (1 million bytes total)

(Case 2) 1/2 million zeros, then 1/2 million values of 99999999. Stored as: start: 0, 0, 0, 0, ... (1/2 million zero deltas) then: 255, 255, 255, 255, ... (99999999 / 255 = 392,156 times repeated, which gets us to number 99,999,780) then: 219, 0, 0, 0, 0, ... (another 1/2 million zero deltas)

So the total amount of storage for Case 2, which I presume is worse case (but correct me if I'm wrong!) is: 500,000 + 392,156 + 1 + 500,000 = 1,392,157 bytes to store the delta values.

1MB = 1,048,576 bytes, so I'm over by 343,581 bytes... (so close!)

We'll have to modify this scheme so that we reduce the number of 255 values, which should not be hard to do and will get us under the 1MB size limit. Or we could try something fancier like huffman coding to reduce the size of the delta values.

[+] teeja|13 years ago|reply
Compressed bucket-ranges works fine for random date, but if the data are skewed in some way (e.g. normal-curve shaped), the bucket-sizes must be adjustable.

An alternative method is to use one bucket into which all values below a limit (e.g. 20,000,000) are sorted as they arrive, and compress all the rest. When 10^6 values have arrived, transmit the bucket and then reuse the empty bucket repeatedly to sort the compressed values.

[+] Zenst|13 years ago|reply
Just a thought but given "have a computer with 1M of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, "

I was wondering why you can't use TCP as a form of storage, possibly many ways but latency and buffers would actualy work for you as crude storage. Not that it is need in this case but it is one form of queue that could be abused to store data.

[+] recon517|13 years ago|reply
Store each number as a delta from a previous one, first number as a delta from 0. Each number starts with a code describing its length: 0 (duplicate), 7, 9 or 27 bits. For storing a code use values '1', '01', '001' and '0001'. Calculate statistics for different code types and when space is tight replace most common codes with shortest ones.