top | item 23113656

Show HN: Radix sort big files in memory

36 points| afiodorov | 5 years ago |github.com

32 comments

order
[+] antender|5 years ago|reply
Was there any external requirement to specifically sort this file in-memory only? Why not just split the file into chunks (around 100MB), sort them as usual, and then k-way merge sort them after that. This, in theory, can be faster than allocating lots of RAM for radix tree, especially if you use SSD instead of HDD.
[+] antender|5 years ago|reply
After consulting with GNU sort manual: sort has -m option just for the case of merging presorted files, so you can test this by using 'split -l', then 'xargs sort' (to parallelize), then 'sort -m' to merge chunks
[+] BubRoss|5 years ago|reply
I'm not sure what the point of this is, it is something basic but implemented poorly.

- Saving the start and end position of a string that represents a date with 16 bytes is silly. Just convert the date to 64 or 32 bit integer (or even less depending on the granularity and range of the dates).

- run through the file converting the dates to a smaller integer representation. When the array of integers is too big, sort it and use that to write a sorted text file chunk

- once that is done, merge the text file chunks together

Any good sorting algorithm is going to be able to do this in under 20 minutes with 16 cores. If IO is a bottleneck it would be much worse while trying to swap around text lines in a giant memory mapped file.

[+] afiodorov|5 years ago|reply
>If IO is a bottleneck it would be much worse while trying to swap around text lines in a giant memory mapped file.

There's no swapping of lines in a giant memory mapped file, the byte array of the file remains intact.

The program only swaps the line markers.

> Just convert the date to 64 or 32 bit integer

This would require extra memory for each line, complicate user interface (data formats vary a lot) and make the utility less generic, since you don't always want to sort chronologically. The current idea is to sort using as little extra memory as possible. This is because when files are already so big fitting them into RAM is a challenge already.

---

Finally, there's no evidence why the proposed solution would be faster. Empirically I have found that sorting chunks and merging is considerably slower than sorting in memory, which makes sense given how much faster RAM is than HDD.

[+] jasonwatkinspdx|5 years ago|reply
> Just convert the date to 64 or 32 bit integer (or even less depending on the granularity and range of the dates).

We use rfc3339 because there's a lot of gotchas that appear when you conflate civil timekeeping with an absolute time scale. The number of seconds in a day is not constant, and generally only known a few months ahead of a coming leap second. Civil timekeeping syntaxes can deal with this gracefully exactly because they're a structured representation rather than absolute nanos relative to some epoch.

[+] afiodorov|5 years ago|reply
>- Saving the start and end position of a string that represents a date with 16 bytes is silly.

I am not saving the start and end positions of a string that represents a data; I am saving start and end position of a line, since I need to split the loaded file into lines. This requires extra memory. I tried sorting without caching the end position of the line (and always reading from start till \n), but it was slower to sort this way.

[+] jrockway|5 years ago|reply
I guess what surprises me about this is how bad "sort" did.
[+] easytiger|5 years ago|reply
One doesn't always control the input format of data one receives. Indeed a lot of stuff I've cobbled together in real work is munging many and varied logs
[+] TheTank|5 years ago|reply
Thanks for sharing. Do I understand correctly that this requires loading the whole file in memory along with an ordered list of keys? Or is it just the first n bytes that are loaded in memory? If the former, then it seems very expensive in terms of RAM, particularly if your data file has multiple columns.

As an alternative I used is to load the file in a database, then sort by the key I want (which only loads the key in memory) and then output the result into a file. It does go through disk but you can address larger files as you only need the key in memory, and not the whole file.

[+] afiodorov|5 years ago|reply
Indeed I load the entire file, but I think provided I have RAM to load the file with all the columns, the approach of loading everything should give me optimal performance. However I agree with your approach when there isn't enough RAM.
[+] jepcommenter|5 years ago|reply
As you sort by first field anyway, could you please try out omitting field split (-t, -k1)? For me it gives a noticeable improvement:

$ stat --printf="%s\n" p.csv

1258291200

$ time sort -t, -k1 -S100% -o sorted.csv p.csv

real 0m50,186s user 4m6,962s sys 0m4,562s

$ time sort -o sorted.csv p.csv

real 0m43,483s user 3m36,473s sys 0m4,282s

[+] jepcommenter|5 years ago|reply
Full line comparison would probably use memcmp that bails out on first non matching character while field-splitting overhead might be significant.
[+] afiodorov|5 years ago|reply
Exact same file has been taking more than 35 minutes to sort already, so it's slower without splitting.

Edit: it finished!

real 35m28.370s

user 40m17.129s

sys 4m31.081s

[+] loeg|5 years ago|reply
Where did you find the dataset, or did you construct your own?
[+] known|5 years ago|reply
What would be the result when we add these sort options

    sort -f -s --batch-size=1024 -T/home
[+] afiodorov|5 years ago|reply
Could you elaborate more why this could be faster? These experiments take some time to complete. Also should I not use --parallel flag?
[+] helltone|5 years ago|reply
Out of curiosity, what is the entropy of your input? ie what's the output of

gzip input.csv -c | wc -c

versus

cat input | wc -l

?