Hadoop reaches 1.0 and my understanding of how to use it is still in development.
Does anyone have a high level resource of how MapReduce works for mediocre programmers like myself that are late to the game? I know she's not ready to have my babies, but surely I could get to know her a little, maybe just be friends? I grabbed a Hadoop pre-made virtual machine the other month and was surely so far over my head that I had to run away to regroup.
In general I have some very unoptimized problems that MapReduce probably isn't the right shoe for, but I'd love to explain to my boss why it's the wrong shoe. And learning about it might be a great start down that path.
A good introduction to MapReduce is probably CouchDB, where you use it for database views instead of SQL-style queries. The basic concepts are:
- The "Map" phase takes a key/value pair of input and produces as many other key/value pairs of output as it wants. This can be zero, it can be one, or it can be over 9000. Each Map over a piece of input data operates in isolation.
- The "Reduce" phase takes a bunch of values with the same (or similar, depending on how it's invoked) keys and reduces them down into one value.
A good example is, say you have a bunch of documents like this:
And you want to find out all the tags, and how many posts have a given tag. First, you have a map phase:
function (doc)
if (doc.type === "post") {
doc.tags.forEach(function (tag) {
emit(tag, 1);
});
}
}
In this case, it filters out all the documents that aren't posts. It then emits a `(tag, 1)` pair for each tag on the post. You may end up with a pair set that looks like:
Since the sum of all the pairs with "databases" was 3, the value for the pair keyed as "databases" was 3. You're not limited to summing - any kind of operation that aggregates multiple values and can be grouped by key will work as well.
Like you said, there are problems that this doesn't work for. But for the problems it does work for, it's very computationally efficient and fun.
Hadoop is much more than MapReduce. It's a fault tolerant, highly scalable file system where you can store a shitload (technical term) of data. That alone is remarkable.
Once your data is there, then you get your map/reduce on. And the best way to get started there is to look into Pig or Hive (high level map/reduce abstractions). Either of those will take you a long way.
Don't write MapReduce. Check out my tutorials of late on http://datasyndrome.com and check out Apache Pig. That you can understand easily. It will MapReduce for you. Try Amazon Elastic MapReduce wih Pig on some logs.
I found "Hadoop: The Definitive Guide" to be excellent (hah, I just noticed there's a quote from me on Twitter on the book's homepage): http://www.hadoopbook.com/
The Hadoop and HBase books from O'Reilly Media are quite good. They gave me a good overview and got me up to speed enough that I was comfortable stepping into a new personal project using them.
Awesome! Now if we can just get HBase to update it's prereqs and bump it's version, I can have some symmetry in my life!
On a more serious note - is anyone using HDFS for something like the WebHDFS stuff was designed? We're currently looking at HDFS right now for an Event Store mechanism, but it appears to me to be pretty large file / stream oriented, and I'm wondering how it will stack up if we want to do something that involves files much smaller than say, 64MB.
If you settle with Java or a bit of java extensions, you can probably write your own TaskSplitter and define a way that hadoop should distribute your jobs into smaller tasks. Be aware: you might end up either having a lot of trouble getting the 'optimal splits', or you'll lose one of Hadoop's major advantages, data (calculation) locality (for example, when you decide to combine 10 smaller files into a single task, and you have 10 different DataNodes, chances are small that all files are stored on the machine that's performing the MapReduce task).
One thing to note, though: HDFS is indeed very stream oriented. It works in blocks of 64 MB (by default), and only sends data upstream when you either close a file or a full block is available to be written. So, when your servers crashes at 63MB, and you have unrecoverable data, you'll have lost all 63MB of data. That was one of the big caveats we had to work around for our own problems we solve with Hadoop.
You can try out MapR's Hadoop distribution, which uses its own filesystem rather than HDFS. The MapR filesystem is not append-only, handles small files well, is easily accessible over NFS, etc.
(*Note: I'm a MapR employee, so obviously I'll be biased towards thinking our stuff is great)
I agree with an earlier comment. Big Data, Hadoop etc. are keywords that are supposed to get big in 2012, however, as a regular web dev, it's hard for me to grasp what it can do unless you have gigantic data stores
Congrats on the milestone to those involved - it's great to have something like this available to everyone for free.
On a side note, and not to take anything away from the H-team, I'm pretty curious on how it compares to Google's GFS and the rest of their distributed computing stack (MR, Chubby, etc.). It would be sweet if Google released some or all of these some day.
The 1.0.0 release is actually formerly known as the 0.20.205.1 release -- ie just bugfixes since 0.20.205.
Hadoop's been "production ready" for years - there are hundreds of companies running it in business critical applications. But some people want to see "1.0" before they move to production :) So we recently decided to call it 1.0 so that the version numbering matches the maturity Hadoop has already achieved.
[+] [-] digitalsushi|14 years ago|reply
Does anyone have a high level resource of how MapReduce works for mediocre programmers like myself that are late to the game? I know she's not ready to have my babies, but surely I could get to know her a little, maybe just be friends? I grabbed a Hadoop pre-made virtual machine the other month and was surely so far over my head that I had to run away to regroup.
In general I have some very unoptimized problems that MapReduce probably isn't the right shoe for, but I'd love to explain to my boss why it's the wrong shoe. And learning about it might be a great start down that path.
[+] [-] LeafStorm|14 years ago|reply
- The "Map" phase takes a key/value pair of input and produces as many other key/value pairs of output as it wants. This can be zero, it can be one, or it can be over 9000. Each Map over a piece of input data operates in isolation.
- The "Reduce" phase takes a bunch of values with the same (or similar, depending on how it's invoked) keys and reduces them down into one value.
A good example is, say you have a bunch of documents like this:
And you want to find out all the tags, and how many posts have a given tag. First, you have a map phase: In this case, it filters out all the documents that aren't posts. It then emits a `(tag, 1)` pair for each tag on the post. You may end up with a pair set that looks like: Then, your reduce phase may look like: Though the kinds of results you get out of it depend on how you invoke it. If you just reduce the whole dataset, for example, you get: Because that's the sum of the values from all the pairs. On the other hand, running it in group mode will reduce each key separately, so you get this: Since the sum of all the pairs with "databases" was 3, the value for the pair keyed as "databases" was 3. You're not limited to summing - any kind of operation that aggregates multiple values and can be grouped by key will work as well.Like you said, there are problems that this doesn't work for. But for the problems it does work for, it's very computationally efficient and fun.
[+] [-] jefft|14 years ago|reply
Once your data is there, then you get your map/reduce on. And the best way to get started there is to look into Pig or Hive (high level map/reduce abstractions). Either of those will take you a long way.
[+] [-] rjurney|14 years ago|reply
[+] [-] simonw|14 years ago|reply
[+] [-] dice|14 years ago|reply
[+] [-] fletchowns|14 years ago|reply
And a nice simple little hadoop setup: http://hadoop.apache.org/common/docs/current/single_node_set...
[+] [-] rberdeen|14 years ago|reply
* 0.23.0: 11 November, 2011
* 0.22.0: 10 December, 2011
Now we have 1.0, but it's based on 0.20, not any of the more recent releases?
The 1.0 release notes are pretty useless--it's just a list of issues. Is there a summary anywhere?
[+] [-] akg|14 years ago|reply
[+] [-] firemanx|14 years ago|reply
On a more serious note - is anyone using HDFS for something like the WebHDFS stuff was designed? We're currently looking at HDFS right now for an Event Store mechanism, but it appears to me to be pretty large file / stream oriented, and I'm wondering how it will stack up if we want to do something that involves files much smaller than say, 64MB.
[+] [-] stingraycharles|14 years ago|reply
One thing to note, though: HDFS is indeed very stream oriented. It works in blocks of 64 MB (by default), and only sends data upstream when you either close a file or a full block is available to be written. So, when your servers crashes at 63MB, and you have unrecoverable data, you'll have lost all 63MB of data. That was one of the big caveats we had to work around for our own problems we solve with Hadoop.
[+] [-] jdf|14 years ago|reply
(*Note: I'm a MapR employee, so obviously I'll be biased towards thinking our stuff is great)
[+] [-] CatDaaaady|14 years ago|reply
[+] [-] knappster|14 years ago|reply
Working with hadoop a few years ago was a pain in the ass, what really made it ready (at least for me) was the packaging done by Cloudera.
[+] [-] nchuhoai|14 years ago|reply
[+] [-] imalolz|14 years ago|reply
On a side note, and not to take anything away from the H-team, I'm pretty curious on how it compares to Google's GFS and the rest of their distributed computing stack (MR, Chubby, etc.). It would be sweet if Google released some or all of these some day.
[+] [-] paraschopra|14 years ago|reply
[+] [-] tlipcon|14 years ago|reply
Hadoop's been "production ready" for years - there are hundreds of companies running it in business critical applications. But some people want to see "1.0" before they move to production :) So we recently decided to call it 1.0 so that the version numbering matches the maturity Hadoop has already achieved.
-Todd (Hadoop PMC)
[+] [-] unknown|14 years ago|reply
[deleted]
[+] [-] unknown|14 years ago|reply
[deleted]