Ask HN: Sorting massive text files?

[+] timr|15 years ago|reply

Unix sort already does a merge sort (search for it), which is your default solution if you just want to sort individual lines based on a field:

http://vkundeti.blogspot.com/2008/03/tech-algorithmic-detail...

Thus, sort -u <filename> is your go-to for simple jobs. (Note that you'll need to have enough extra disk space to hold all of the temporary files.)

If you need to do something more sophisticated (e.g. joining lines from a web server log into sessions, then sorting sessions), you can still use divide-and-conquer, but you have to be smarter. Divide the file into N parts based on some logic (i.e. divide into files based on session ID), then sort the lines in each individually, then merge the results back together.

This is what map/reduce frameworks are made to do, of course, but something like Hadoop may be overkill unless you plan to do this type of thing often.

[+] arnabdotorg|15 years ago|reply

Merge sort works very well if you have multiple cores available. The general idea is to split the file, sort in parallel, and then merge.

Here's a blog post I wrote about it:

http://arnab.org/blog/quick-and-easy-multicore-sort

[+] ralph|15 years ago|reply

GNU sort has a --compress-program option that can help if you're really tight on scratch space, e.g. lzop(1). The -T option or TMPDIR environment variable lets you state a filesystem with more free space can be used instead.

[+] jbl|15 years ago|reply

This is the approach that I use except that I generate a Makefile which has:

  - a target to split the input into chunks
  - targets to sort each of the split chunks
  - a target to merge the sorted chunks using "sort -m"

Then, if you are using GNU make, it's just a matter of using the -j flag to control how many chunks are sorted in parallel.

[+] lsb|15 years ago|reply

0) Get your data into Amazon's cloud, and rent a server on EC2 with 68GB of memory. Spot instances are a buck an hour.

1) Do sort -u --buffer-size=60GB FILE. You'll sort it all in memory, and that'll be a great speedup.

It's easier to scale up than scale out if you have the money, and your dataset fits in memory, so don't bother with Hadoop for something as simple as that. What do you want to do after you get the uniques?

[+] trin_|15 years ago|reply

but wouldnt that also include uploading 35gb of data and finally downloading xxGB (result size)? i think when you add this time you can just take your multicore desktop and let it run over night.

[+] brown9-2|15 years ago|reply

The canonical Hadoop example is counting the number of words in a large text file http://wiki.apache.org/hadoop/WordCount

You could use the same approach to simply take the unique tokens of the output of the final "count".

May be overkill if you you can read the file in less than an hour, but this approach (divide and conquer) may be a good inspiration.

[+] naz|15 years ago|reply

   sort -u <filename>

would be faster than all that piping

[+] secretasiandan|15 years ago|reply

I've never used it but there's a -S flag for sort that allows you to specify buffer size. So maybe look into increasing that as large as you can.

[+] jemfinch|15 years ago|reply

Compared to the IO costs, the computational difference must be inconsequential.

[+] ralph|15 years ago|reply

If you're going to process this file a lot, and it's straightforward text, then consider compressing it with lzop(1). lzop's fast as compressing, and very fast at decompressing, and the overhead of having to decompress on every read through may well be much less than the time saved by waiting for disk.

    lzop -9v $file
    lzop -dc $file.lzo | sort ...

You shouldn't use cat(1) if you don't need to. You're removing the ability of wc(1) to do faster things than read(2) from a pipe, e.g. mmap(2) the file, which is probably what cat is doing.

Similarly don't use uniq(1) if sort's -u option will suffice as you're forcing 35GB of data to be written down a pipe to uniq(1) when it's quite possible that `sort -u' would produce 100MB or so.

[+] pbh|15 years ago|reply

Personally, I would run:

  split -d -C 1G FILE split.FILE # separate into 1G files on line boundaries
  # then, on separate cores
  sort -u split.FILE.00 -o sort.FILE.00
  sort -u split.FILE.01 -o sort.FILE.01
  ...
  sort -u split.FILE.N -o sort.FILE.N
  # then, on one core
  sort -m -u sort.FILE.*

You should be aware that you may not get what you expect from sort unless you set LC_ALL="C". You should also pass "-S" to sort to set the main memory size, probably to a value like "1G". Of course, at some point when you're doing a lot of distributed processing you'll just want to use Hadoop, as other posters have noted.

[+] pjscott|15 years ago|reply

It looks like GNU sort already does this internally, but with smarter temp file management. Increasing the memory size looks like a good idea, and if you have multiple hard drives, it may help to specify a temporary directory on a different hard drive with the -T option, since this problem looks IO-bound.

[+] JoachimSchipper|15 years ago|reply

Your example suggests that you are unhappy with the time it takes to read data from disk (the above is clearly I/O-bound). There are a zillion ways to get unique records from the file - the frequently-mentioned sort -u is one - but there's no way to do it without reading all of the file...

[+] uuoc|15 years ago|reply

You also gain a "Useless use of cat" award for both of your examples: http://partmaps.org/era/unix/award.html

Both can be rewritten as "sort FILE | uniq" and "time wc -l FILE".

And given the data size, both non-cat uses would likely be faster as well.

[+] pixelbeat|15 years ago|reply

We've made a multicore update to GNU sort recently, so with latest from git one can just do: sort -u That will chunk, compress, merge automatically Also run with LC_ALL=C if that's appropriate as it will greatly speedup. Also see the -S

[+] wooby|15 years ago|reply

You might check out bashreduce: http://github.com/erikfrey/bashreduce

[+] nagrom|15 years ago|reply

If this is just a one off job, I would guess that running sort -u $FILE will take less time than writing some specific software for the purpose.

Of course, if you want to learn hadoop then go for it. But it's probably more practical just to let it run :-)

[+] davidst|15 years ago|reply

How important is it that you get every unique line? Is it ok for some small percentage of unique lines to not appear in the output?

How much memory do you have available?

The Bloom filter is a good way to go. Here is one possibility:

If you have 2 billion unique lines here's how much memory you need for a Bloom filter:

Error rate Memory

------------ --------

1 / thousand 3.35 GB

1 / million 6.70 GB

1 / billion 10.05 GB

1 / 10 billion 11.16 GB

1 / 100 billion 12.28 GB

If you have 12 GB available you can reduce the chance of missing a unique string down to almost zero.

[+] davidst|15 years ago|reply

BTW, as you desire to decrease your error rate Cuckoo hashing starts to become an attractive alternative to Bloom filters:

http://en.wikipedia.org/wiki/Cuckoo_hashing http://www.ru.is/faculty/ulfar/CuckooHash.pdf

You can get your hash table utilization to almost 100%. If you're willing to accept a 64-bit hash function and you have 2 billion unique strings you'll need just over 16GB to do it all in one pass.

A 64-bit hash is is just barely enough for 2 billion items. You might get a handful of collisions. Add more bits to reduce the probability. 128-bits would double the required size to 32GB but you're unlikely to ever see a collision in your life time.

If 16GB or 32GB is too much you could do it in multiple passes. Make N passes through the text file and adjust the range of hash values you accept to test only 1/N each time.

8 passes through the original file with a generous 128-bit hash would take only 4GB of RAM.

I have the feeling this is more work than you would want to put into the problem, though.

[+] andymoe|15 years ago|reply

Shove your data into postgresql or mysql with one of their bulk import utilities and let the db handle the hard work of indexing and filtering. That's what they are designed to do. You can then slice and dice the data to your hearts content. Make sure you have plenty of disk space as indexing you data will take up a good amount of space as will the overhead of storing it in the db in the first place.

[+] jey|15 years ago|reply

That's a very roundabout and slow way to do it if he never uses the DB for anything else. Building those B-trees will be way more expensive than a merge sort.

[+] aw3c2|15 years ago|reply

I guess a line count has to run through the whole file, so obviously your storage is the bottleneck. Let's say your storage manages 100MB/s constantly for the whole 35GB it would take about 6 minutes.

Your 'cat' is not needed by the way, sort takes a filename as argument. And sort can '-u'!

It would be interesting if sort would fail on this (why?) or how long it would take.

edit: Why on earth are you doing this as root?

[+] meterplech|15 years ago|reply

If you have access to K (or you can use the free J), this is exactly the type of problem they can solve. If you put the file into a list x, ?x is the only command you need to get a list of uniques.

http://en.wikipedia.org/wiki/K_(programming_language)

[+] Groxx|15 years ago|reply

Does "put[ting] the file into a list" mean "loading the file into memory"?

[+] unknown|15 years ago|reply

[deleted]

[+] unknown|15 years ago|reply

[deleted]

[+] secretasiandan|15 years ago|reply

Do you know what percent of records you expect to be unique?

Also, what do you want to do with the unique records? That might effect what initial processing method is best for your goal.

[+] Tichy|15 years ago|reply

Never did something like it, but can't resist thinking about it.

Maybe a kind of divide and conquer could work? Split into several files, do the sort | uniq on each of them. Then merge them, checking for duplicates on the way. I think merging should be almost as fast as line counting, at least linear in the size of the two files.

Edit: I guess it would be slower than counting, because presumably it would write to a new file (the merged file). But still, it should be linear.

[+] rwaliany|15 years ago|reply

I would use map/reduce to solve this problem. You can run a single node psuedo-cluster in a few minutes by using Cloudera's package repositories (CDH3, https://wiki.cloudera.com/display/DOC/Hadoop+(CDH3)+Quick+St...). http://en.wikipedia.org/wiki/MapReduce

68 comments