Ask HN: Sorting massive text files?
44 points| JBerlinsky | 15 years ago | reply
time cat FILE | wc -l 2608560847
real 11m18.148s user 1m35.667s sys 1m33.820s [root@server src]#
Any suggestions on how I can go about getting unique records from this type of file?
[+] [-] timr|15 years ago|reply
http://vkundeti.blogspot.com/2008/03/tech-algorithmic-detail...
Thus, sort -u <filename> is your go-to for simple jobs. (Note that you'll need to have enough extra disk space to hold all of the temporary files.)
If you need to do something more sophisticated (e.g. joining lines from a web server log into sessions, then sorting sessions), you can still use divide-and-conquer, but you have to be smarter. Divide the file into N parts based on some logic (i.e. divide into files based on session ID), then sort the lines in each individually, then merge the results back together.
This is what map/reduce frameworks are made to do, of course, but something like Hadoop may be overkill unless you plan to do this type of thing often.
[+] [-] arnabdotorg|15 years ago|reply
Here's a blog post I wrote about it:
http://arnab.org/blog/quick-and-easy-multicore-sort
[+] [-] ralph|15 years ago|reply
[+] [-] jbl|15 years ago|reply
[+] [-] lsb|15 years ago|reply
1) Do sort -u --buffer-size=60GB FILE. You'll sort it all in memory, and that'll be a great speedup.
It's easier to scale up than scale out if you have the money, and your dataset fits in memory, so don't bother with Hadoop for something as simple as that. What do you want to do after you get the uniques?
[+] [-] trin_|15 years ago|reply
[+] [-] brown9-2|15 years ago|reply
You could use the same approach to simply take the unique tokens of the output of the final "count".
May be overkill if you you can read the file in less than an hour, but this approach (divide and conquer) may be a good inspiration.
[+] [-] naz|15 years ago|reply
[+] [-] secretasiandan|15 years ago|reply
[+] [-] jemfinch|15 years ago|reply
[+] [-] ralph|15 years ago|reply
Similarly don't use uniq(1) if sort's -u option will suffice as you're forcing 35GB of data to be written down a pipe to uniq(1) when it's quite possible that `sort -u' would produce 100MB or so.
[+] [-] pbh|15 years ago|reply
[+] [-] pjscott|15 years ago|reply
[+] [-] JoachimSchipper|15 years ago|reply
[+] [-] uuoc|15 years ago|reply
Both can be rewritten as "sort FILE | uniq" and "time wc -l FILE".
And given the data size, both non-cat uses would likely be faster as well.
[+] [-] pixelbeat|15 years ago|reply
[+] [-] wooby|15 years ago|reply
[+] [-] nagrom|15 years ago|reply
Of course, if you want to learn hadoop then go for it. But it's probably more practical just to let it run :-)
[+] [-] davidst|15 years ago|reply
How much memory do you have available?
The Bloom filter is a good way to go. Here is one possibility:
If you have 2 billion unique lines here's how much memory you need for a Bloom filter:
Error rate Memory
------------ --------
1 / thousand 3.35 GB
1 / million 6.70 GB
1 / billion 10.05 GB
1 / 10 billion 11.16 GB
1 / 100 billion 12.28 GB
If you have 12 GB available you can reduce the chance of missing a unique string down to almost zero.
[+] [-] davidst|15 years ago|reply
http://en.wikipedia.org/wiki/Cuckoo_hashing http://www.ru.is/faculty/ulfar/CuckooHash.pdf
You can get your hash table utilization to almost 100%. If you're willing to accept a 64-bit hash function and you have 2 billion unique strings you'll need just over 16GB to do it all in one pass.
A 64-bit hash is is just barely enough for 2 billion items. You might get a handful of collisions. Add more bits to reduce the probability. 128-bits would double the required size to 32GB but you're unlikely to ever see a collision in your life time.
If 16GB or 32GB is too much you could do it in multiple passes. Make N passes through the text file and adjust the range of hash values you accept to test only 1/N each time.
8 passes through the original file with a generous 128-bit hash would take only 4GB of RAM.
I have the feeling this is more work than you would want to put into the problem, though.
[+] [-] andymoe|15 years ago|reply
[+] [-] jey|15 years ago|reply
[+] [-] aw3c2|15 years ago|reply
Your 'cat' is not needed by the way, sort takes a filename as argument. And sort can '-u'!
It would be interesting if sort would fail on this (why?) or how long it would take.
edit: Why on earth are you doing this as root?
[+] [-] meterplech|15 years ago|reply
http://en.wikipedia.org/wiki/K_(programming_language)
[+] [-] Groxx|15 years ago|reply
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] secretasiandan|15 years ago|reply
Also, what do you want to do with the unique records? That might effect what initial processing method is best for your goal.
[+] [-] Tichy|15 years ago|reply
Maybe a kind of divide and conquer could work? Split into several files, do the sort | uniq on each of them. Then merge them, checking for duplicates on the way. I think merging should be almost as fast as line counting, at least linear in the size of the two files.
Edit: I guess it would be slower than counting, because presumably it would write to a new file (the merged file). But still, it should be linear.
[+] [-] rwaliany|15 years ago|reply