I am not sure I see the point in implementing MapReduce when most of the current work seems to be in generalising MapReduce - i.e. Apache Spark and YARN. Is there any reasoning behind this?
Not the author of the project but I can think of two reasons.
Firstly, you can think of map/reduce as the infrastructure for higher level operations (sort of like the assembly language of large scale data processing that higher-level data processing systems compile to). A breakthrough in the quality of the operational engine significantly impacts the experience of doing higher-level work, so if someone finds a better way to run map/reduce jobs, it's a win for everyone. Shipping jars instead of docker containers, and not having snapshots are serious drawbacks in the existing map/reduce infrastructure that significantly impact users in negative ways.
Secondly, an easier way to specify map/reduce jobs (via a simple web server that exposes API endpoints to do data grouping, mapping, and reduction) is a dramatically simpler, more composable way to expose map/reduce jobs. Building higher level infrastructure on top of this abstraction is an order of magnitude easier than doing it on top of Hadoop, so it could be a better underlying platform for the generalization work being done in the community.
This is a very good question. There's a growing sentiment in the Hadoop ecosystem that MapReduce is in someway passe and I think it's somewhat unfair. A lot of the confusion comes from the fact that people don't distinguish between Hadoop's implementation of MapReduce and MapReduce the paradigm. As a paradigm MapReduce is actually very general. A good example of this is stream processing. Hadoop has completely separate implementations for stream processing in Storm. However there's no inherent reason MapReduce can't operate on streams. In fact, in pfs where the file system can be thought of as a stream of commits that's the only thing that it does operate on.
So tl;dr we think that a better implementation of MapReduce can be a much more general tool than Hadoop's MapReduce is.
coffeemug|11 years ago
Firstly, you can think of map/reduce as the infrastructure for higher level operations (sort of like the assembly language of large scale data processing that higher-level data processing systems compile to). A breakthrough in the quality of the operational engine significantly impacts the experience of doing higher-level work, so if someone finds a better way to run map/reduce jobs, it's a win for everyone. Shipping jars instead of docker containers, and not having snapshots are serious drawbacks in the existing map/reduce infrastructure that significantly impact users in negative ways.
Secondly, an easier way to specify map/reduce jobs (via a simple web server that exposes API endpoints to do data grouping, mapping, and reduction) is a dramatically simpler, more composable way to expose map/reduce jobs. Building higher level infrastructure on top of this abstraction is an order of magnitude easier than doing it on top of Hadoop, so it could be a better underlying platform for the generalization work being done in the community.
jdoliner|11 years ago
jdoliner|11 years ago
So tl;dr we think that a better implementation of MapReduce can be a much more general tool than Hadoop's MapReduce is.