I'm becoming a stronger and stronger advocate of teaching command-line interfaces to even programmers at the novice level...it's easier in many ways to think of how data is being worked on by "filters" and "pipes"...and more importantly, every time you try a step, something happens...making it much easier to interactively iterate through a process.
That it also happens to very fast and powerful (when memory isn't a limiting factor) is nice icing on the cake. I moved over to doing much more on CLI after realizing that doing something as simple as "head -n 1 massive.csv" to inspect headers of corrupt multi-gb CSV files made my data-munging life substantially more enjoyable than opening them up in Sublime Text.
A few years ago between projects, my coworkers cooked up some satirical amazing Web 2.0 data science tools. They used git, did a screencast and distributed it internally.
It was basically a few compiled perl scripts and some obfuscated shell scripts with a layer of glitz. People actually used it and LOVED it... It was supposedly better than the real tools some groups were using.
It was one of the more epic work trolls I've ever seen!
Your CSV peeking epiphany was in essence a matter of code vs. tools though rather than necessarily CLI vs. GUI. On Windows you might just as well have discovered you could fire up Linqpad and enter File.ReadLines("massive.csv").First() for example.
Perhaps I'm missing something. It appears that the author is recommending against using Hadoop (and related tools) for processing 3.5GB of data. Who in the world thought that would be a good idea to begin with?
The underlying problem here isn't unique to Hadoop. People who are minimally familiar with how technology works and who are very much into BuzzWords™ will always throw around the wrong tool for the job so they can sound intelligent with a certain segment of the population.
That said, I like seeing how people put together their own CLI-based processing pipelines.
I've used hadoop at petabyte scale (2+pb input; 10+pb sorted for the job) for machine learning tasks. If you have such a thing on your resume, you will be inundated with employers who have "big data", and at least half will be under 50g with a good chunk of those under 10g. You'll also see multiple (shitty) 16 machine clusters, any of which -- for any task -- could be destroyed by code running on a single decent server with ssds. Let alone hadoop jobs running in emr, which is glacially slow (slow disk, slow network, slow everything.)
Also, hadoop is so painfully slow to develop in it's practically a full employment act for software engineers. I imagine it's similar to early ejb coding.
Exactly this just happened where I work. The CIO was recommending Hadoop on AWS for our image processing/analysis jobs. We process a single set of images at a time which come in around ~1.5GB. The output data size is about 1.2GB. Not a good candidate for Hadoop but, you know... "big data", right?
"Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data (tm) tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques."
I think the point the author is making is that although they knew from the start that Hadoop wasn't necessary for the job, many people probably don't.
Lots of people think that is "big data". For most people if it's too big for an Excel spreadsheet, it's "big data" and the way you process big data is with Hadoop. Of course once you show them the billable hours difference between setting up a Hadoop cluster, and (in my case at least) using python libraries on a MBP, they change their minds real fast. Its just a matter of "big data" being a new thing, people will figure it out as time goes on and things settle down.
People love the idea of being big and having "big problems". Wanting to use Hadoop isn't that different from wanting to use all sorts of fancy "web-scale" databases.
Most of us don't have scaling issues or big data, but that sort of excludes us from using all the fancy new tools that we want to play with. I'm still convinced that most of the stuff I work on at work could be run on SQLite, with designed a bit more careful.
The truth is that most of us will never do anything that couldn't be solved with 10 year old technology. And honestly we should happy, there's a certain comfort in being able to use simple and generally understood tools.
Some have questioned why I would spend the time advocating against the use of Hadoop for such small data processing tasks as that's clearly not when it should be used anyway. Sadly, Big Data (tm) frameworks are often recommended, required, or used more often than they should be. I know to many of us it seems crazy, but it's true. The worst I've seen was Hadoop used for a processing task of less than 1MB. Seriously.
Also, much agreement with those saying there should be more education effort when it comes to teaching command line tools. O'Reilly even has a book out on the topic: http://shop.oreilly.com/product/0636920032823.do
Author of Data Science at the Command Line here. Thanks for the nice blog post and for mentioning my book here. While we're talking about the subject of education, allow me to shamelessly promote a two-day workshop that I'll be giving next month in London: http://datascienceatthecommandline.com/#workshop
This is a great article and a fun read. A friend sent it over to me and I wrote him some notes about my thoughts. Having since realised it's on HN, I thought I'd post them here as well.
Some of my wording is a bit terse; sorry! :-) The article is great and I really enjoyed it. He's certainly got the right solution for the particular task at hand (which I think is his entire point) but he's generally right for the wrong reasons so I pick a few holes in that: I'm not trying to be mean. :-)
-----
Classic system sizing problem!
1.75GiB will fit in the page cache on any machine with >2GiB RAM.
One of the big problems is that people really don't know what to expect
so they don't realise that their performance is orders of magnitude
lower than it "should" be.
Part of this is because the numbers involved are really large:
1,879,048,192 bytes (1.7GiB) is an incomprehensibly large number.
2,600,000,000 times per second (2.6GHz) is an incomprehensibly large
number of things that can be done in the blink of an eye.
...But if you divide them using simple units analysis; things per second
divided by things gives you seconds: 1.383. That's assuming that you can
process 1 byte per clock cycle which might be reasonable if the data is
small and fits in cache. If we're going to be doing things in streaming
mode then we'll be limited by memory bandwidth, not clock speed.
Which means we should be able to squeeze our data through the processor
in...
bytes per second divided by bytes = seconds =>
>>> 8004829184 / 1879048192.0
4.260044642857143
so less than 5 seconds.
We probably want to assume that there are other stalls and overheads,
but a number between 20 and 60 seconds seems reasonable for that
workload (he gets 12): the article says it's just a summary plus
aggregate workload so we don't really need to allocate much in the way
of arithmetic power.
As with most things in x86, memory bandwidth is usually the bottleneck.
If you're not getting numbers with an order of magnitude or so of memory
bandwidth then either you have a arithmetic workload (and you know it)
or you have a crap tool.
Due to the memory fetch patterns and latencies on x86, it's often
possible to reorder your data access to get a nominal arithmetic
workload close to the memory bandwidth expected speed.
His analysis about the parallelisation of the workload due to shell
commands is incorrect. The speedup comes from accessing the stuff
straight from the page cache.
His analysis about loading the data into memory on Hadoop is incorrect.
The slowdown in Hadoop probably comes from memory copying, allocation
and GC involved in transforming the raw data from the page cache into
object in the language that Hadoop is written in and then throwing them
away again. That's just a guess because you want memory to fill up (to
about 1.75GiB) so that you don't have to go to disk. That memory is held
by the OS rather than the userland apps tho'.
His conclusion about how `sleep 3 | echo "Hello"` is done is incorrect.
They're "done at the same time" because sleep closes stdout immediately
rather than at the end of the three seconds. With a tool like uniq or
sort it has to ingest all the data before it can begin because that's
the nature of the algorithm. A tool like cat will give you line-by-line
flow because it can but the pipeline is strictly serial in nature and
(as with uniq or sort), might stall in certain places.
He claims that the processing is "non-IO-bound" but also encourages the
user to clear the page cache. Clearing the page cache forces the
workload to be IO bound by definition. The page cache is there to "hide"
the IO bottlenecks where possible. If you're doing a few different
calculations using a few different pipelines then you want the page
cache to remain full as it will mean that the data doesn't have to be
reread from disk for each pipeline invocation.
For example, when I ingest photos from my CF card, I use "cp" to get the
data from the card to my local disk. The card is 8GiB. I have 16GiB of
RAM. That cp usually reads ahead almost the whole card and then
bottlenecks on the write part of getting it onto disk. That data then
sits in RAM for as long as it can (until the memory is needed by
something else) which is good because after the "cp" I invoke md5sum to
calculate some checksums of all the files. This is CPU bound and runs
way faster than it would if it was IO bound due to having to reread all
that data from disk. (My arrangement is still suboptimal but this gives
an example of how I can get advantages from the architecture without
having to do early optimisations in my app: my ingest script is "fast
enough" and I can just about afford to do the md5sum later because I can
be almost certain it's going to use the exact same data that was read
from the card rather than the copied data that is reread from disk and,
theoretically, might read differently.)
He's firmly in the realm of "small data" by 4 or 5 base 10 orders of
magnitude (at least) so he's nowhere close to getting a "scaling curve"
that will tell him where best to optimise for the general case. When he
starts getting to workloads 2 or 3 orders of magnitude bigger than what
he's doing he might find that there are a certain class of optimisations
that present themselves but that probably won't be the "big data"
general case.
Having said that, this makes his approach entirely appropriate for the particular task at hand (which I think it his entire point).
Through his use of xargs he implies (but does not directly acknowledge)
that he realises this is a so-called "embarrassingly parallel" problem.
-----
I think it is unsafe to parallelize grep with xargs as in done in the article, because, beyond delivery order shuffling, the output of the parallel greps could get mixed up (the beginning of a line is by one grep and the end of a line is from a different grep, so, reading line by line afterwards, you get garbled lines).
Author begins with fairly idiomatic shell pipeline, but in the search for performance the pipeline transforms to a awk script. Not that I have anything against awk, but I feel like that kinda runs against the premise of the article. The article ends up demonstrating the power of awk over pipelines of small utilities.
Another interesting note is that there is a possibility that the script as-is could mis-parse the data. The grep should use '^\[Result' instead of 'Result'. I think this demonstrates nicely the fragility of these sorts of ad-hoc parsers that are common in shell pipelines.
It probably depends on what you are trying to accomplish... I think a lot of us would reach for a scripting language to run through this (relatively small amount of data)... node.js does piped streams of input/output really well. And perl is the grand daddy of this type of input processing.
I wouldn't typically reach for a big data solution short of hundreds of gigs of data (which is borderline, but will only grow from there). I might even reach for something like ElasticSearch as an interim step, which will usually be enough.
If you can dedicate a VM in a cloud service to a single one-off task, that's probably a better option than creating a Hadoop cluster for most work loads.
Bottom line is - you do not need hadoop until you cross 2TB of data to be processed (uncompressed).
Modern servers ( bare metal ones, not what AWS sells you ) are REALLY FAST and can crunch massive amounts of data.
Just use a proper tools, well optimized code written in C/C++/Go/etc - not all the crappy JAVA framework-in-a-framework^N architecture that abstracts thinking about the CPU speed.
Bottom line, the popular saying is true:
"Hadoop is about writing crappy code and then running it on a massive scale."
Dell sells a server with 6TB of ram (I believe.) I think the limit is way over 2TB. If you want to be able to query it quickly for analytical workloads, MPPs like Vertica scale up to 150+TB (at Facebook.) I honestly don't know what the scale is where you need Hadoop, but it's gotten to be a large number very quickly.
My question is what do you mean by 2TB? At my current client, we have 5 TBs of data sitting (that's relatively recent). Before we had 2-ish. However, we had over 30 applications doing complex fraud calculations on that. "Moving data" (data being read and then worked) is about 40 TB daily. Even with SSD and 256 GB of RAM, a single machine would get overwhelmed on this.
If you're only working one app on less than 1 TB, maybe you don't need something as complex as Hadoop. But given that a cluster is easy to setup (I made a really simple NameNode + Two Data nodes in 45 minutes, going cold), it might not be a bad idea.
I'll take this further and say that some tools for Hadoop that are not from Apache are really nice to work with even in a for non-Hadoop work. For example, I've got to join several 1 GB files together to go from a relational, CSV model into a Document store model. Can I do this with command line tools? Maybe. Cascading makes this really easy. Each file family is a tap. I get tuple joins naturally. I wrote an ArangoDB tap to auto load into ArangoDB. It was fun, testable and easy. All of this runs sans-hadoop on my little MBP.
Fun fact about the Cascading tool set is that I can take my little app from my desktop and plop it onto a Hadoop cluster with little change (taps from local to hadoop). Will I do that in my present example? No. Can I think of places where that's really useful? Yes, daily 35 fraud models' regression tests executed with each build. That's somewhere around 500 full model executions over limited, but meaningful data. All easily done courtesy of a framework that targets Hadoop.
It's about how Joyent took the concept of a UNIX pipeline as a true powertool and built a distributed version atop an object filesystem with some little map/reduce syntactic sugar to replace Hadoop jobs with pipelines.
The Bryan Cantrill talk is definitely worth your time, but
you can get an understanding of Manta with their 3m screencast: https://youtu.be/d2KQ2SQLQgg
I have developed a one-liner toolset for Hadoop (when I have to use it). It's fresh to see a ZFS alternate of the concept. Don't like the JavaScript choice though.
GUN parallel should be a widely adopted choice. Lightweight. Fast. Low cost. Extendable.
I had an intern over the summer, working on a basic A/B Testing framework for our application (a very simple industrial handscanner tool used inside warehouses by a few thousand employees).
When we came to the last stage, analysis, he was keen to use MapReduce so we let him. In the end though, his analysis didn't work well, took ages to process when it did, and didn't provide the answers we needed. The code wasn't maintainable or reusable. shrug It happens. I had worse internships.
I put together some command line scripts to parse the files instead- grep, awk, sed, really basic stuff piped into each other and written to other files. They took 10 minutes or so to process, and provided reliable answers. The scripts were added as an appendix to the report I provided on the A/B test, and after formatting and explanations, took up a couple pages.
I used Hadoop a few times this semester for different classes and it seemed like the code was so easy to write and then because everything is either a Mapper or a Reducer, you just read enough of the docs to figure out what is intended to be done and then build on top of it, can I ask how it wasn't maintainable?
We have a proprietary algorithm for assigning foods a "suitability score" based on a user's personal health conditions and body data.
It used to be a fairly slow algorithm, so we ran it in a hadoop cluster and it cached the scores for every user vs. every food in a massive table on a distributed database.
Another developer, who is quite clever, rewrote our algorithm in C, and compiled it as a database function, which was about 100x faster. He also did some algebra work and found a way to change our calculations, yielding a measly 4-5x improvement.
It was so, so, so much faster that in one swoop we eliminated our entire Hadoop cluster, and the massive scores table, and were actually able sort your food search results by score, calculating scores on the fly.
This also isn't a straight either or proposition. I build local command line pipelines and do testing and/or processing. When either the amount of data needed to be processed passes into the range where memory or network bandwidth makes the processing more efficient on a Hadoop cluster I make some fairly minimal conversions and run the stream processing on the Hadoop cluster in streaming mode. It hasn't been uncommon for my jobs to be much faster than the same jobs run on the cluster with Hive or some other framework. Much of the speed boils down to the optimizer and the planner.
Overall I find it very efficient to use the same toolset locally and then scale it up to a cluster when and if I need to.
This is a great exercise of how to take a Unix command line and iteratively optimize it with advanced use of awk.
In that spirit, one can optimize the xargs mawk invocation by 1) Getting rid of string-manipulation function calls (which are slow in awk), 2) using regular expressions in the pattern expression (which allows awk to short-circuit the evaluation of lines), and 3) avoiding use of field variables like $1, and $2, which allows the mawk virtual machine to avoid implicit field splitting. A bonus is that you end up with an awk script which is more idiomatic:
Notice that I got rid of the printing out of the intermediate totals per file. Since we are only tabulating the final total, we can modify the 'reduce' mawk invocation to be as follows:
mawk '
{games += ($1+$2+$3); white += $1; black += $2; draw += $3}
END { print games, white, black, draw }'
Making the bottle-neck data stream thinner always helps with overall throughput.
"Here's a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.
The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script."
Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?
I can't stand developers who come up with the xargs/wget approach, hack something together and then walk away from it. I've seen it far too often and it's great for the short term. Dreadful for the long term.
Alternative, real life scenario: navigate through 6 months of daily MySQL dumps, assorted YAML files and Rails production.log, looking for some cross product between tables, requests and serialised entities, for analysis and/or data recovery (pinpoint or retrieval).
zcat/cut/sed/grep/awk/perl crawled through it in a couple of minutes and required less than half an hour to craft a reliable enough implementation (including looking up relations from foreign keys straight from the SQL dumps).
My colleagues, who still don't get the point of a command line, would still be restoring each dump individually and running SQL requests manually to this day (or more probably declare it "too complex" and mark it as a loss). Side note: I'm torn between leaving this place where nobody seems to understand the point of anything remotely like engineering or keeping this job where I'm obviously being extremely useful to our customers.
If so, using wget is a poor solution. I have not used wget in over a decade but as I recall it does not do HTTP pipelining; I could be wrong on that - please correct me.
I do recall with certainty that when wget was first written and disseminated in the 1990's, "webmasters" wanted to ban it. httpd's were not as resilient then as they are today, nor was bandwidth and hardware as inexpensive.
HTTP pipelining is a smarter alternative than burdening the remote host with thousands of consecutive or simultaneous connections.
Depending on the remote host's httpd settings, HTTP pipelining usually lets you make 100 or maybe more requests using a single connection. It can be acomplished with only a simple tcpclient like the original nc and the shell.
In any event, the line about a "distributed crawler" is spot on. Never understimate the power of marketing to suspend common sense.
Also, I find that I can often speed my scripts up a little by using exec in shell pipelines, e.g., util1 |exec util2 or exec util1 |exec util2.
There are other, better approaches besides using the builtin exec, but I will leave those for another day.
You could just use Scrapy [1]. Easy to setup, and plenty of options you can activate if needed. Likely more robust than shell scripts as well. No Hadoop involved.
Having written distributed crawlers, saturating the network connection is quite easy to do and is the main reason for even distributing that type of work in the first place.
you know, or the real world reasonnable mature engineering answer, a Java/C#/C++ scalable parallel tool using modern libraries and MPI if it ever needs to scale.
I feel ag (silver surfer, a grep-ish alternative) should be mentioned (even though he dropped it in his final awk/mawk commands) as it tends to be much faster than grep, and considering he cites performance throughout.
I built ag for searching code. It can be (ab)used for other stuff, but the defaults are optimized for a developer searching a codebase. Also, when writing ag, I don't go out of my way to make sure behavior is correct on all platforms in all corner cases. Grep, on the other hand, has been around for decades. It probably handles cases I've never even thought of.
[+] [-] danso|11 years ago|reply
That it also happens to very fast and powerful (when memory isn't a limiting factor) is nice icing on the cake. I moved over to doing much more on CLI after realizing that doing something as simple as "head -n 1 massive.csv" to inspect headers of corrupt multi-gb CSV files made my data-munging life substantially more enjoyable than opening them up in Sublime Text.
[+] [-] Spooky23|11 years ago|reply
It was basically a few compiled perl scripts and some obfuscated shell scripts with a layer of glitz. People actually used it and LOVED it... It was supposedly better than the real tools some groups were using.
It was one of the more epic work trolls I've ever seen!
[+] [-] hanoz|11 years ago|reply
[+] [-] crcsmnky|11 years ago|reply
The underlying problem here isn't unique to Hadoop. People who are minimally familiar with how technology works and who are very much into BuzzWords™ will always throw around the wrong tool for the job so they can sound intelligent with a certain segment of the population.
That said, I like seeing how people put together their own CLI-based processing pipelines.
[+] [-] x0x0|11 years ago|reply
Also, hadoop is so painfully slow to develop in it's practically a full employment act for software engineers. I imagine it's similar to early ejb coding.
[+] [-] EpicEng|11 years ago|reply
[+] [-] theVirginian|11 years ago|reply
I think the point the author is making is that although they knew from the start that Hadoop wasn't necessary for the job, many people probably don't.
[+] [-] sputknick|11 years ago|reply
[+] [-] mrweasel|11 years ago|reply
Most of us don't have scaling issues or big data, but that sort of excludes us from using all the fancy new tools that we want to play with. I'm still convinced that most of the stuff I work on at work could be run on SQLite, with designed a bit more careful.
The truth is that most of us will never do anything that couldn't be solved with 10 year old technology. And honestly we should happy, there's a certain comfort in being able to use simple and generally understood tools.
[+] [-] aadrake|11 years ago|reply
Some have questioned why I would spend the time advocating against the use of Hadoop for such small data processing tasks as that's clearly not when it should be used anyway. Sadly, Big Data (tm) frameworks are often recommended, required, or used more often than they should be. I know to many of us it seems crazy, but it's true. The worst I've seen was Hadoop used for a processing task of less than 1MB. Seriously.
Also, much agreement with those saying there should be more education effort when it comes to teaching command line tools. O'Reilly even has a book out on the topic: http://shop.oreilly.com/product/0636920032823.do
Thank you for all the comments and support.
[+] [-] jeroenjanssens|11 years ago|reply
[+] [-] andyjpb|11 years ago|reply
Some of my wording is a bit terse; sorry! :-) The article is great and I really enjoyed it. He's certainly got the right solution for the particular task at hand (which I think is his entire point) but he's generally right for the wrong reasons so I pick a few holes in that: I'm not trying to be mean. :-)
----- Classic system sizing problem!
1.75GiB will fit in the page cache on any machine with >2GiB RAM.
One of the big problems is that people really don't know what to expect so they don't realise that their performance is orders of magnitude lower than it "should" be.
Part of this is because the numbers involved are really large: 1,879,048,192 bytes (1.7GiB) is an incomprehensibly large number. 2,600,000,000 times per second (2.6GHz) is an incomprehensibly large number of things that can be done in the blink of an eye.
...But if you divide them using simple units analysis; things per second divided by things gives you seconds: 1.383. That's assuming that you can process 1 byte per clock cycle which might be reasonable if the data is small and fits in cache. If we're going to be doing things in streaming mode then we'll be limited by memory bandwidth, not clock speed.
http://www.techspot.com/review/266-value-cpu-roundup/page5.h... reckons that the Intel Core 2 Duo E7500 @ 2.93GHz) has 7,634MiB/s of memory bandwidth for reads.
That's 8,004,829,184 bytes per second.
Which means we should be able to squeeze our data through the processor in...
bytes per second divided by bytes = seconds =>
>>> 8004829184 / 1879048192.0 4.260044642857143
so less than 5 seconds.
We probably want to assume that there are other stalls and overheads, but a number between 20 and 60 seconds seems reasonable for that workload (he gets 12): the article says it's just a summary plus aggregate workload so we don't really need to allocate much in the way of arithmetic power.
As with most things in x86, memory bandwidth is usually the bottleneck. If you're not getting numbers with an order of magnitude or so of memory bandwidth then either you have a arithmetic workload (and you know it) or you have a crap tool.
Due to the memory fetch patterns and latencies on x86, it's often possible to reorder your data access to get a nominal arithmetic workload close to the memory bandwidth expected speed.
His analysis about the parallelisation of the workload due to shell commands is incorrect. The speedup comes from accessing the stuff straight from the page cache.
His analysis about loading the data into memory on Hadoop is incorrect. The slowdown in Hadoop probably comes from memory copying, allocation and GC involved in transforming the raw data from the page cache into object in the language that Hadoop is written in and then throwing them away again. That's just a guess because you want memory to fill up (to about 1.75GiB) so that you don't have to go to disk. That memory is held by the OS rather than the userland apps tho'.
His conclusion about how `sleep 3 | echo "Hello"` is done is incorrect. They're "done at the same time" because sleep closes stdout immediately rather than at the end of the three seconds. With a tool like uniq or sort it has to ingest all the data before it can begin because that's the nature of the algorithm. A tool like cat will give you line-by-line flow because it can but the pipeline is strictly serial in nature and (as with uniq or sort), might stall in certain places.
He claims that the processing is "non-IO-bound" but also encourages the user to clear the page cache. Clearing the page cache forces the workload to be IO bound by definition. The page cache is there to "hide" the IO bottlenecks where possible. If you're doing a few different calculations using a few different pipelines then you want the page cache to remain full as it will mean that the data doesn't have to be reread from disk for each pipeline invocation.
For example, when I ingest photos from my CF card, I use "cp" to get the data from the card to my local disk. The card is 8GiB. I have 16GiB of RAM. That cp usually reads ahead almost the whole card and then bottlenecks on the write part of getting it onto disk. That data then sits in RAM for as long as it can (until the memory is needed by something else) which is good because after the "cp" I invoke md5sum to calculate some checksums of all the files. This is CPU bound and runs way faster than it would if it was IO bound due to having to reread all that data from disk. (My arrangement is still suboptimal but this gives an example of how I can get advantages from the architecture without having to do early optimisations in my app: my ingest script is "fast enough" and I can just about afford to do the md5sum later because I can be almost certain it's going to use the exact same data that was read from the card rather than the copied data that is reread from disk and, theoretically, might read differently.)
He's firmly in the realm of "small data" by 4 or 5 base 10 orders of magnitude (at least) so he's nowhere close to getting a "scaling curve" that will tell him where best to optimise for the general case. When he starts getting to workloads 2 or 3 orders of magnitude bigger than what he's doing he might find that there are a certain class of optimisations that present themselves but that probably won't be the "big data" general case.
Having said that, this makes his approach entirely appropriate for the particular task at hand (which I think it his entire point).
Through his use of xargs he implies (but does not directly acknowledge) that he realises this is a so-called "embarrassingly parallel" problem. -----
[+] [-] a3_nm|11 years ago|reply
See https://www.gnu.org/software/parallel/man.html#DIFFERENCES-B...
[+] [-] pkrumins|11 years ago|reply
[+] [-] dice|11 years ago|reply
[+] [-] zokier|11 years ago|reply
Another interesting note is that there is a possibility that the script as-is could mis-parse the data. The grep should use '^\[Result' instead of 'Result'. I think this demonstrates nicely the fragility of these sorts of ad-hoc parsers that are common in shell pipelines.
[+] [-] tracker1|11 years ago|reply
I wouldn't typically reach for a big data solution short of hundreds of gigs of data (which is borderline, but will only grow from there). I might even reach for something like ElasticSearch as an interim step, which will usually be enough.
If you can dedicate a VM in a cloud service to a single one-off task, that's probably a better option than creating a Hadoop cluster for most work loads.
[+] [-] rkwasny|11 years ago|reply
Just use a proper tools, well optimized code written in C/C++/Go/etc - not all the crappy JAVA framework-in-a-framework^N architecture that abstracts thinking about the CPU speed.
Bottom line, the popular saying is true: "Hadoop is about writing crappy code and then running it on a massive scale."
[+] [-] earino|11 years ago|reply
[+] [-] virmundi|11 years ago|reply
If you're only working one app on less than 1 TB, maybe you don't need something as complex as Hadoop. But given that a cluster is easy to setup (I made a really simple NameNode + Two Data nodes in 45 minutes, going cold), it might not be a bad idea.
I'll take this further and say that some tools for Hadoop that are not from Apache are really nice to work with even in a for non-Hadoop work. For example, I've got to join several 1 GB files together to go from a relational, CSV model into a Document store model. Can I do this with command line tools? Maybe. Cascading makes this really easy. Each file family is a tap. I get tuple joins naturally. I wrote an ArangoDB tap to auto load into ArangoDB. It was fun, testable and easy. All of this runs sans-hadoop on my little MBP.
Fun fact about the Cascading tool set is that I can take my little app from my desktop and plop it onto a Hadoop cluster with little change (taps from local to hadoop). Will I do that in my present example? No. Can I think of places where that's really useful? Yes, daily 35 fraud models' regression tests executed with each build. That's somewhere around 500 full model executions over limited, but meaningful data. All easily done courtesy of a framework that targets Hadoop.
[+] [-] treve|11 years ago|reply
[+] [-] ricardobeat|11 years ago|reply
Results: 4.4GB[1] processed in 47 seconds. Around 96mb/s, can probably be made faster, and nodejs is not the best at munging data...
[1] 3201 files taken from http://github.com/rozim/ChessData
[+] [-] notpeter|11 years ago|reply
It's about how Joyent took the concept of a UNIX pipeline as a true powertool and built a distributed version atop an object filesystem with some little map/reduce syntactic sugar to replace Hadoop jobs with pipelines.
The Bryan Cantrill talk is definitely worth your time, but you can get an understanding of Manta with their 3m screencast: https://youtu.be/d2KQ2SQLQgg
[+] [-] cheng1|11 years ago|reply
GUN parallel should be a widely adopted choice. Lightweight. Fast. Low cost. Extendable.
[+] [-] sam_lowry_|11 years ago|reply
See https://github.com/cheusov/paexec
[+] [-] pmoriarty|11 years ago|reply
[+] [-] jacquesm|11 years ago|reply
https://news.ycombinator.com/item?id=8902739
[+] [-] mabbo|11 years ago|reply
When we came to the last stage, analysis, he was keen to use MapReduce so we let him. In the end though, his analysis didn't work well, took ages to process when it did, and didn't provide the answers we needed. The code wasn't maintainable or reusable. shrug It happens. I had worse internships.
I put together some command line scripts to parse the files instead- grep, awk, sed, really basic stuff piped into each other and written to other files. They took 10 minutes or so to process, and provided reliable answers. The scripts were added as an appendix to the report I provided on the A/B test, and after formatting and explanations, took up a couple pages.
[+] [-] arthurcolle|11 years ago|reply
Just curious
[+] [-] m_mueller|11 years ago|reply
[+] [-] knodi123|11 years ago|reply
It used to be a fairly slow algorithm, so we ran it in a hadoop cluster and it cached the scores for every user vs. every food in a massive table on a distributed database.
Another developer, who is quite clever, rewrote our algorithm in C, and compiled it as a database function, which was about 100x faster. He also did some algebra work and found a way to change our calculations, yielding a measly 4-5x improvement.
It was so, so, so much faster that in one swoop we eliminated our entire Hadoop cluster, and the massive scores table, and were actually able sort your food search results by score, calculating scores on the fly.
[+] [-] saym|11 years ago|reply
[+] [-] NyxWulf|11 years ago|reply
Overall I find it very efficient to use the same toolset locally and then scale it up to a cluster when and if I need to.
[+] [-] azylman|11 years ago|reply
[+] [-] decisiveness|11 years ago|reply
[+] [-] taltman1|11 years ago|reply
In that spirit, one can optimize the xargs mawk invocation by 1) Getting rid of string-manipulation function calls (which are slow in awk), 2) using regular expressions in the pattern expression (which allows awk to short-circuit the evaluation of lines), and 3) avoiding use of field variables like $1, and $2, which allows the mawk virtual machine to avoid implicit field splitting. A bonus is that you end up with an awk script which is more idiomatic:
Notice that I got rid of the printing out of the intermediate totals per file. Since we are only tabulating the final total, we can modify the 'reduce' mawk invocation to be as follows: Making the bottle-neck data stream thinner always helps with overall throughput.[+] [-] philgoetz|11 years ago|reply
Second, you don't get to pretend you invented shell scripting because you came up with a new name for it.
Third, there are very few cases if any where writing a shell script is better than writing a Perl script.
[+] [-] MrBuddyCasino|11 years ago|reply
"Here's a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.
The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script."
[0] since his blog is gone: http://readwrite.com/2011/01/22/data-mining-and-taco-bell-pr...
[+] [-] threeseed|11 years ago|reply
Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?
I can't stand developers who come up with the xargs/wget approach, hack something together and then walk away from it. I've seen it far too often and it's great for the short term. Dreadful for the long term.
[+] [-] lloeki|11 years ago|reply
zcat/cut/sed/grep/awk/perl crawled through it in a couple of minutes and required less than half an hour to craft a reliable enough implementation (including looking up relations from foreign keys straight from the SQL dumps).
My colleagues, who still don't get the point of a command line, would still be restoring each dump individually and running SQL requests manually to this day (or more probably declare it "too complex" and mark it as a loss). Side note: I'm torn between leaving this place where nobody seems to understand the point of anything remotely like engineering or keeping this job where I'm obviously being extremely useful to our customers.
[+] [-] 101914|11 years ago|reply
If so, using wget is a poor solution. I have not used wget in over a decade but as I recall it does not do HTTP pipelining; I could be wrong on that - please correct me.
I do recall with certainty that when wget was first written and disseminated in the 1990's, "webmasters" wanted to ban it. httpd's were not as resilient then as they are today, nor was bandwidth and hardware as inexpensive.
HTTP pipelining is a smarter alternative than burdening the remote host with thousands of consecutive or simultaneous connections.
Depending on the remote host's httpd settings, HTTP pipelining usually lets you make 100 or maybe more requests using a single connection. It can be acomplished with only a simple tcpclient like the original nc and the shell.
In any event, the line about a "distributed crawler" is spot on. Never understimate the power of marketing to suspend common sense.
Also, I find that I can often speed my scripts up a little by using exec in shell pipelines, e.g., util1 |exec util2 or exec util1 |exec util2.
There are other, better approaches besides using the builtin exec, but I will leave those for another day.
[+] [-] mercurial|11 years ago|reply
1: http://doc.scrapy.org/en/0.24/intro/tutorial.html
[+] [-] profquail|11 years ago|reply
[+] [-] juliangregorian|11 years ago|reply
[+] [-] make3|11 years ago|reply
[+] [-] kylek|11 years ago|reply
[+] [-] ggreer|11 years ago|reply
I built ag for searching code. It can be (ab)used for other stuff, but the defaults are optimized for a developer searching a codebase. Also, when writing ag, I don't go out of my way to make sure behavior is correct on all platforms in all corner cases. Grep, on the other hand, has been around for decades. It probably handles cases I've never even thought of.
[+] [-] wglb|11 years ago|reply
edit: with HN commentary: https://news.ycombinator.com/item?id=615587