top | item 8908462

Command-line tools can be faster than your Hadoop cluster

812 points| wglb | 11 years ago |aadrake.com | reply

309 comments

order
[+] danso|11 years ago|reply
I'm becoming a stronger and stronger advocate of teaching command-line interfaces to even programmers at the novice level...it's easier in many ways to think of how data is being worked on by "filters" and "pipes"...and more importantly, every time you try a step, something happens...making it much easier to interactively iterate through a process.

That it also happens to very fast and powerful (when memory isn't a limiting factor) is nice icing on the cake. I moved over to doing much more on CLI after realizing that doing something as simple as "head -n 1 massive.csv" to inspect headers of corrupt multi-gb CSV files made my data-munging life substantially more enjoyable than opening them up in Sublime Text.

[+] Spooky23|11 years ago|reply
A few years ago between projects, my coworkers cooked up some satirical amazing Web 2.0 data science tools. They used git, did a screencast and distributed it internally.

It was basically a few compiled perl scripts and some obfuscated shell scripts with a layer of glitz. People actually used it and LOVED it... It was supposedly better than the real tools some groups were using.

It was one of the more epic work trolls I've ever seen!

[+] hanoz|11 years ago|reply
Your CSV peeking epiphany was in essence a matter of code vs. tools though rather than necessarily CLI vs. GUI. On Windows you might just as well have discovered you could fire up Linqpad and enter File.ReadLines("massive.csv").First() for example.
[+] crcsmnky|11 years ago|reply
Perhaps I'm missing something. It appears that the author is recommending against using Hadoop (and related tools) for processing 3.5GB of data. Who in the world thought that would be a good idea to begin with?

The underlying problem here isn't unique to Hadoop. People who are minimally familiar with how technology works and who are very much into BuzzWords™ will always throw around the wrong tool for the job so they can sound intelligent with a certain segment of the population.

That said, I like seeing how people put together their own CLI-based processing pipelines.

[+] x0x0|11 years ago|reply
I've used hadoop at petabyte scale (2+pb input; 10+pb sorted for the job) for machine learning tasks. If you have such a thing on your resume, you will be inundated with employers who have "big data", and at least half will be under 50g with a good chunk of those under 10g. You'll also see multiple (shitty) 16 machine clusters, any of which -- for any task -- could be destroyed by code running on a single decent server with ssds. Let alone hadoop jobs running in emr, which is glacially slow (slow disk, slow network, slow everything.)

Also, hadoop is so painfully slow to develop in it's practically a full employment act for software engineers. I imagine it's similar to early ejb coding.

[+] EpicEng|11 years ago|reply
Exactly this just happened where I work. The CIO was recommending Hadoop on AWS for our image processing/analysis jobs. We process a single set of images at a time which come in around ~1.5GB. The output data size is about 1.2GB. Not a good candidate for Hadoop but, you know... "big data", right?
[+] theVirginian|11 years ago|reply
"Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data (tm) tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques."

I think the point the author is making is that although they knew from the start that Hadoop wasn't necessary for the job, many people probably don't.

[+] sputknick|11 years ago|reply
Lots of people think that is "big data". For most people if it's too big for an Excel spreadsheet, it's "big data" and the way you process big data is with Hadoop. Of course once you show them the billable hours difference between setting up a Hadoop cluster, and (in my case at least) using python libraries on a MBP, they change their minds real fast. Its just a matter of "big data" being a new thing, people will figure it out as time goes on and things settle down.
[+] mrweasel|11 years ago|reply
People love the idea of being big and having "big problems". Wanting to use Hadoop isn't that different from wanting to use all sorts of fancy "web-scale" databases.

Most of us don't have scaling issues or big data, but that sort of excludes us from using all the fancy new tools that we want to play with. I'm still convinced that most of the stuff I work on at work could be run on SQLite, with designed a bit more careful.

The truth is that most of us will never do anything that couldn't be solved with 10 year old technology. And honestly we should happy, there's a certain comfort in being able to use simple and generally understood tools.

[+] aadrake|11 years ago|reply
Hi all, original author here.

Some have questioned why I would spend the time advocating against the use of Hadoop for such small data processing tasks as that's clearly not when it should be used anyway. Sadly, Big Data (tm) frameworks are often recommended, required, or used more often than they should be. I know to many of us it seems crazy, but it's true. The worst I've seen was Hadoop used for a processing task of less than 1MB. Seriously.

Also, much agreement with those saying there should be more education effort when it comes to teaching command line tools. O'Reilly even has a book out on the topic: http://shop.oreilly.com/product/0636920032823.do

Thank you for all the comments and support.

[+] jeroenjanssens|11 years ago|reply
Author of Data Science at the Command Line here. Thanks for the nice blog post and for mentioning my book here. While we're talking about the subject of education, allow me to shamelessly promote a two-day workshop that I'll be giving next month in London: http://datascienceatthecommandline.com/#workshop
[+] andyjpb|11 years ago|reply
This is a great article and a fun read. A friend sent it over to me and I wrote him some notes about my thoughts. Having since realised it's on HN, I thought I'd post them here as well.

Some of my wording is a bit terse; sorry! :-) The article is great and I really enjoyed it. He's certainly got the right solution for the particular task at hand (which I think is his entire point) but he's generally right for the wrong reasons so I pick a few holes in that: I'm not trying to be mean. :-)

----- Classic system sizing problem!

1.75GiB will fit in the page cache on any machine with >2GiB RAM.

One of the big problems is that people really don't know what to expect so they don't realise that their performance is orders of magnitude lower than it "should" be.

Part of this is because the numbers involved are really large: 1,879,048,192 bytes (1.7GiB) is an incomprehensibly large number. 2,600,000,000 times per second (2.6GHz) is an incomprehensibly large number of things that can be done in the blink of an eye.

...But if you divide them using simple units analysis; things per second divided by things gives you seconds: 1.383. That's assuming that you can process 1 byte per clock cycle which might be reasonable if the data is small and fits in cache. If we're going to be doing things in streaming mode then we'll be limited by memory bandwidth, not clock speed.

http://www.techspot.com/review/266-value-cpu-roundup/page5.h... reckons that the Intel Core 2 Duo E7500 @ 2.93GHz) has 7,634MiB/s of memory bandwidth for reads.

That's 8,004,829,184 bytes per second.

Which means we should be able to squeeze our data through the processor in...

bytes per second divided by bytes = seconds =>

>>> 8004829184 / 1879048192.0 4.260044642857143

so less than 5 seconds.

We probably want to assume that there are other stalls and overheads, but a number between 20 and 60 seconds seems reasonable for that workload (he gets 12): the article says it's just a summary plus aggregate workload so we don't really need to allocate much in the way of arithmetic power.

As with most things in x86, memory bandwidth is usually the bottleneck. If you're not getting numbers with an order of magnitude or so of memory bandwidth then either you have a arithmetic workload (and you know it) or you have a crap tool.

Due to the memory fetch patterns and latencies on x86, it's often possible to reorder your data access to get a nominal arithmetic workload close to the memory bandwidth expected speed.

His analysis about the parallelisation of the workload due to shell commands is incorrect. The speedup comes from accessing the stuff straight from the page cache.

His analysis about loading the data into memory on Hadoop is incorrect. The slowdown in Hadoop probably comes from memory copying, allocation and GC involved in transforming the raw data from the page cache into object in the language that Hadoop is written in and then throwing them away again. That's just a guess because you want memory to fill up (to about 1.75GiB) so that you don't have to go to disk. That memory is held by the OS rather than the userland apps tho'.

His conclusion about how `sleep 3 | echo "Hello"` is done is incorrect. They're "done at the same time" because sleep closes stdout immediately rather than at the end of the three seconds. With a tool like uniq or sort it has to ingest all the data before it can begin because that's the nature of the algorithm. A tool like cat will give you line-by-line flow because it can but the pipeline is strictly serial in nature and (as with uniq or sort), might stall in certain places.

He claims that the processing is "non-IO-bound" but also encourages the user to clear the page cache. Clearing the page cache forces the workload to be IO bound by definition. The page cache is there to "hide" the IO bottlenecks where possible. If you're doing a few different calculations using a few different pipelines then you want the page cache to remain full as it will mean that the data doesn't have to be reread from disk for each pipeline invocation.

For example, when I ingest photos from my CF card, I use "cp" to get the data from the card to my local disk. The card is 8GiB. I have 16GiB of RAM. That cp usually reads ahead almost the whole card and then bottlenecks on the write part of getting it onto disk. That data then sits in RAM for as long as it can (until the memory is needed by something else) which is good because after the "cp" I invoke md5sum to calculate some checksums of all the files. This is CPU bound and runs way faster than it would if it was IO bound due to having to reread all that data from disk. (My arrangement is still suboptimal but this gives an example of how I can get advantages from the architecture without having to do early optimisations in my app: my ingest script is "fast enough" and I can just about afford to do the md5sum later because I can be almost certain it's going to use the exact same data that was read from the card rather than the copied data that is reread from disk and, theoretically, might read differently.)

He's firmly in the realm of "small data" by 4 or 5 base 10 orders of magnitude (at least) so he's nowhere close to getting a "scaling curve" that will tell him where best to optimise for the general case. When he starts getting to workloads 2 or 3 orders of magnitude bigger than what he's doing he might find that there are a certain class of optimisations that present themselves but that probably won't be the "big data" general case.

Having said that, this makes his approach entirely appropriate for the particular task at hand (which I think it his entire point).

Through his use of xargs he implies (but does not directly acknowledge) that he realises this is a so-called "embarrassingly parallel" problem. -----

[+] a3_nm|11 years ago|reply
I think it is unsafe to parallelize grep with xargs as in done in the article, because, beyond delivery order shuffling, the output of the parallel greps could get mixed up (the beginning of a line is by one grep and the end of a line is from a different grep, so, reading line by line afterwards, you get garbled lines).

See https://www.gnu.org/software/parallel/man.html#DIFFERENCES-B...

[+] pkrumins|11 years ago|reply
The example in the article with cat, grep and awk:

    cat *.pgn | \
    grep "Result" | \
    awk '
     {
        split($0, a, "-");
        res = substr(a[1], length(a[1]), 1);
        if (res == 1) white++;
        if (res == 0) black++;
        if (res == 2) draw++;
      }
      END { print white+black+draw, white, black, draw }
    '
Can be written much more succinctly with just awk, and you don't even need to split the string or use substr:

    awk '
      /Result/ {
        if (/1\/2/) draw++;
        else if (/1-0/) white++;
        else if (/0-1/) black++;
      }
      END { print white+black+draw, white, black, draw }
    ' *.pgn
[+] dice|11 years ago|reply
Keep reading, he removes the cat and grep in the final solution.
[+] zokier|11 years ago|reply
Author begins with fairly idiomatic shell pipeline, but in the search for performance the pipeline transforms to a awk script. Not that I have anything against awk, but I feel like that kinda runs against the premise of the article. The article ends up demonstrating the power of awk over pipelines of small utilities.

Another interesting note is that there is a possibility that the script as-is could mis-parse the data. The grep should use '^\[Result' instead of 'Result'. I think this demonstrates nicely the fragility of these sorts of ad-hoc parsers that are common in shell pipelines.

[+] tracker1|11 years ago|reply
It probably depends on what you are trying to accomplish... I think a lot of us would reach for a scripting language to run through this (relatively small amount of data)... node.js does piped streams of input/output really well. And perl is the grand daddy of this type of input processing.

I wouldn't typically reach for a big data solution short of hundreds of gigs of data (which is borderline, but will only grow from there). I might even reach for something like ElasticSearch as an interim step, which will usually be enough.

If you can dedicate a VM in a cloud service to a single one-off task, that's probably a better option than creating a Hadoop cluster for most work loads.

[+] rkwasny|11 years ago|reply
Bottom line is - you do not need hadoop until you cross 2TB of data to be processed (uncompressed). Modern servers ( bare metal ones, not what AWS sells you ) are REALLY FAST and can crunch massive amounts of data.

Just use a proper tools, well optimized code written in C/C++/Go/etc - not all the crappy JAVA framework-in-a-framework^N architecture that abstracts thinking about the CPU speed.

Bottom line, the popular saying is true: "Hadoop is about writing crappy code and then running it on a massive scale."

[+] earino|11 years ago|reply
Dell sells a server with 6TB of ram (I believe.) I think the limit is way over 2TB. If you want to be able to query it quickly for analytical workloads, MPPs like Vertica scale up to 150+TB (at Facebook.) I honestly don't know what the scale is where you need Hadoop, but it's gotten to be a large number very quickly.
[+] virmundi|11 years ago|reply
My question is what do you mean by 2TB? At my current client, we have 5 TBs of data sitting (that's relatively recent). Before we had 2-ish. However, we had over 30 applications doing complex fraud calculations on that. "Moving data" (data being read and then worked) is about 40 TB daily. Even with SSD and 256 GB of RAM, a single machine would get overwhelmed on this.

If you're only working one app on less than 1 TB, maybe you don't need something as complex as Hadoop. But given that a cluster is easy to setup (I made a really simple NameNode + Two Data nodes in 45 minutes, going cold), it might not be a bad idea.

I'll take this further and say that some tools for Hadoop that are not from Apache are really nice to work with even in a for non-Hadoop work. For example, I've got to join several 1 GB files together to go from a relational, CSV model into a Document store model. Can I do this with command line tools? Maybe. Cascading makes this really easy. Each file family is a tap. I get tuple joins naturally. I wrote an ArangoDB tap to auto load into ArangoDB. It was fun, testable and easy. All of this runs sans-hadoop on my little MBP.

Fun fact about the Cascading tool set is that I can take my little app from my desktop and plop it onto a Hadoop cluster with little change (taps from local to hadoop). Will I do that in my present example? No. Can I think of places where that's really useful? Yes, daily 35 fraud models' regression tests executed with each build. That's somewhere around 500 full model executions over limited, but meaningful data. All easily done courtesy of a framework that targets Hadoop.

[+] treve|11 years ago|reply
What makes 2TB the cutoff?
[+] notpeter|11 years ago|reply
This article echoes a talk Bryan Cantrill gave two years ago: https://youtu.be/S0mviKhVmBI

It's about how Joyent took the concept of a UNIX pipeline as a true powertool and built a distributed version atop an object filesystem with some little map/reduce syntactic sugar to replace Hadoop jobs with pipelines.

The Bryan Cantrill talk is definitely worth your time, but you can get an understanding of Manta with their 3m screencast: https://youtu.be/d2KQ2SQLQgg

[+] cheng1|11 years ago|reply
I have developed a one-liner toolset for Hadoop (when I have to use it). It's fresh to see a ZFS alternate of the concept. Don't like the JavaScript choice though.

GUN parallel should be a widely adopted choice. Lightweight. Fast. Low cost. Extendable.

[+] sam_lowry_|11 years ago|reply
Next to using `xargs -P 8 -n 1` to parallellize jobs locally, take a look at paexec, GNU parallel replacement that just works.

See https://github.com/cheusov/paexec

[+] pmoriarty|11 years ago|reply
What's the advantage of using paexec over GNU parallel?
[+] mabbo|11 years ago|reply
I had an intern over the summer, working on a basic A/B Testing framework for our application (a very simple industrial handscanner tool used inside warehouses by a few thousand employees).

When we came to the last stage, analysis, he was keen to use MapReduce so we let him. In the end though, his analysis didn't work well, took ages to process when it did, and didn't provide the answers we needed. The code wasn't maintainable or reusable. shrug It happens. I had worse internships.

I put together some command line scripts to parse the files instead- grep, awk, sed, really basic stuff piped into each other and written to other files. They took 10 minutes or so to process, and provided reliable answers. The scripts were added as an appendix to the report I provided on the A/B test, and after formatting and explanations, took up a couple pages.

[+] arthurcolle|11 years ago|reply
I used Hadoop a few times this semester for different classes and it seemed like the code was so easy to write and then because everything is either a Mapper or a Reducer, you just read enough of the docs to figure out what is intended to be done and then build on top of it, can I ask how it wasn't maintainable?

Just curious

[+] m_mueller|11 years ago|reply
On a tangent, I'd be interested in how you format heavily piped bash code for documentation. Can comments be intersparsed there?
[+] knodi123|11 years ago|reply
We have a proprietary algorithm for assigning foods a "suitability score" based on a user's personal health conditions and body data.

It used to be a fairly slow algorithm, so we ran it in a hadoop cluster and it cached the scores for every user vs. every food in a massive table on a distributed database.

Another developer, who is quite clever, rewrote our algorithm in C, and compiled it as a database function, which was about 100x faster. He also did some algebra work and found a way to change our calculations, yielding a measly 4-5x improvement.

It was so, so, so much faster that in one swoop we eliminated our entire Hadoop cluster, and the massive scores table, and were actually able sort your food search results by score, calculating scores on the fly.

[+] saym|11 years ago|reply
May I ask: Who is we?
[+] NyxWulf|11 years ago|reply
This also isn't a straight either or proposition. I build local command line pipelines and do testing and/or processing. When either the amount of data needed to be processed passes into the range where memory or network bandwidth makes the processing more efficient on a Hadoop cluster I make some fairly minimal conversions and run the stream processing on the Hadoop cluster in streaming mode. It hasn't been uncommon for my jobs to be much faster than the same jobs run on the cluster with Hive or some other framework. Much of the speed boils down to the optimizer and the planner.

Overall I find it very efficient to use the same toolset locally and then scale it up to a cluster when and if I need to.

[+] azylman|11 years ago|reply
What toolset are you using that you can run both locally and on a Hadoop cluster?
[+] decisiveness|11 years ago|reply
If bash is the shell (assuming recursive search is required), maybe it would be even faster to just do:

    shopt -s globstar
    mawk '/Result/ {
        game++
        split($0, a, "-")
        res = substr(a[1], length(a[1]), 1)
        if(res == 1)
            white++
        if(res == 0)
            black++
        if(res == 2)
            draw++
    } END {
        print game, white, black, draw
    }' **/*.pgn

?
[+] taltman1|11 years ago|reply
This is a great exercise of how to take a Unix command line and iteratively optimize it with advanced use of awk.

In that spirit, one can optimize the xargs mawk invocation by 1) Getting rid of string-manipulation function calls (which are slow in awk), 2) using regular expressions in the pattern expression (which allows awk to short-circuit the evaluation of lines), and 3) avoiding use of field variables like $1, and $2, which allows the mawk virtual machine to avoid implicit field splitting. A bonus is that you end up with an awk script which is more idiomatic:

  mawk '
  /^\[Result "1\/2-1\/2"\]/ { draw++ }
  /^\[Result "1-0"\]/ { white++ }
  /^\[Result "0-1"\]/ { black++ }

  END { print white, black, draw }'  
Notice that I got rid of the printing out of the intermediate totals per file. Since we are only tabulating the final total, we can modify the 'reduce' mawk invocation to be as follows:

  mawk '
  {games += ($1+$2+$3); white += $1; black += $2; draw += $3}
  END { print games, white, black, draw }'
Making the bottle-neck data stream thinner always helps with overall throughput.
[+] philgoetz|11 years ago|reply
First, you don't score points with me for saying not to use Hadoop when you don't need to use Hadoop.

Second, you don't get to pretend you invented shell scripting because you came up with a new name for it.

Third, there are very few cases if any where writing a shell script is better than writing a Perl script.

[+] MrBuddyCasino|11 years ago|reply
To quote the memorable Ted Dziuba[0]:

"Here's a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.

The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script."

[0] since his blog is gone: http://readwrite.com/2011/01/22/data-mining-and-taco-bell-pr...

[+] threeseed|11 years ago|reply
Oh right the "cool kids" approach.

Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?

I can't stand developers who come up with the xargs/wget approach, hack something together and then walk away from it. I've seen it far too often and it's great for the short term. Dreadful for the long term.

[+] lloeki|11 years ago|reply
Alternative, real life scenario: navigate through 6 months of daily MySQL dumps, assorted YAML files and Rails production.log, looking for some cross product between tables, requests and serialised entities, for analysis and/or data recovery (pinpoint or retrieval).

zcat/cut/sed/grep/awk/perl crawled through it in a couple of minutes and required less than half an hour to craft a reliable enough implementation (including looking up relations from foreign keys straight from the SQL dumps).

My colleagues, who still don't get the point of a command line, would still be restoring each dump individually and running SQL requests manually to this day (or more probably declare it "too complex" and mark it as a loss). Side note: I'm torn between leaving this place where nobody seems to understand the point of anything remotely like engineering or keeping this job where I'm obviously being extremely useful to our customers.

[+] 101914|11 years ago|reply
Are these "web pages" all on the same website?

If so, using wget is a poor solution. I have not used wget in over a decade but as I recall it does not do HTTP pipelining; I could be wrong on that - please correct me.

I do recall with certainty that when wget was first written and disseminated in the 1990's, "webmasters" wanted to ban it. httpd's were not as resilient then as they are today, nor was bandwidth and hardware as inexpensive.

HTTP pipelining is a smarter alternative than burdening the remote host with thousands of consecutive or simultaneous connections.

Depending on the remote host's httpd settings, HTTP pipelining usually lets you make 100 or maybe more requests using a single connection. It can be acomplished with only a simple tcpclient like the original nc and the shell.

In any event, the line about a "distributed crawler" is spot on. Never understimate the power of marketing to suspend common sense.

Also, I find that I can often speed my scripts up a little by using exec in shell pipelines, e.g., util1 |exec util2 or exec util1 |exec util2.

There are other, better approaches besides using the builtin exec, but I will leave those for another day.

[+] juliangregorian|11 years ago|reply
Having written distributed crawlers, saturating the network connection is quite easy to do and is the main reason for even distributing that type of work in the first place.
[+] make3|11 years ago|reply
you know, or the real world reasonnable mature engineering answer, a Java/C#/C++ scalable parallel tool using modern libraries and MPI if it ever needs to scale.
[+] kylek|11 years ago|reply
I feel ag (silver surfer, a grep-ish alternative) should be mentioned (even though he dropped it in his final awk/mawk commands) as it tends to be much faster than grep, and considering he cites performance throughout.
[+] ggreer|11 years ago|reply
GitHub link for those who don't know about it: https://github.com/ggreer/the_silver_searcher/

I built ag for searching code. It can be (ab)used for other stuff, but the defaults are optimized for a developer searching a codebase. Also, when writing ag, I don't go out of my way to make sure behavior is correct on all platforms in all corner cases. Grep, on the other hand, has been around for decades. It probably handles cases I've never even thought of.