MapReduce: A step backwards, perhaps, but maybe a couple of steps forward, too
Eugene Shekita, IBM Almaden

Introduction
------------
In January 2008, David DeWitt and Michael Stonebraker wrote a provocative
blog titled "MapReduce: A major step backwards" [1][2]. The blog slammed
MapReduce on several fronts, arguing that for large scale data analysis,
the MapReduce framework (and in particular the open-source implementation
of it in Hadoop [3]), is not novel and inferior to commercial shared-nothing
RDBMS engines like Teradata, DB2, and Vertica. Moreover, the authors accused
the Hadoop community of engaging in hype and ignoring 25 years of parallel
DBMS research.

A small flame war erupted on the blog, with a lot of the Hadoop supporters
(the "young turks") basically arguing that 'these old database guys (the
"gray beards") just don't get it.' Slightly more than a year has past since
DeWitt and Stonebraker wrote their blog. It's interesting to revisit some
of the criticism leveled at MapReduce (and Hadoop in particular) to see 
whether those young turks are learning from their elders and whether they
may be onto something after all.

Data Models, Schemas, and Query Languages
-----------------------------------------
One criticism leveled at MapReduce was that it's too low level -- that it
lacks a real data model, a schema, and a catalog that can be queried;
worse, it lacks a high level query language. The database community learned
the importance of all these things about 30 years ago, of course.

When they wrote their blog, DeWitt and Stonebraker did mention that the Pig
project [4] at Yahoo! Research, which adds a semi-structured data model and
high level query language to Hadoop, was a step in the right direction.  
Since then Pig has become much more complete, other query languages for 
Hadoop have also emerged, including Facebook's Hive [5], IBM's Jaql [6] ,
Cascading [7], and Cloudbase [8].

It's clear from all this activity that Hadoop will eventually have a real
data model, schema, catalogs, and query language -- probably several of
them. The young turks are learning. What's also interesting is that many of
the emerging data models and query languages have a distinctly different
flavor than SQL. For example, Pig, Jaql, and Cascading all support
semi-structured data and are really more like high-level, parallel data flow
languages than declarative query languages.

Are the young turks onto something here? Many analysis problems require 
semi-structured data and are more naturally expressed as a sequence of
dataflow steps [15][16]. Languages like Pig, Jaql, and Cascading enable 
these dataflows to be described at a high level and then compiled into
parallel MapReduce jobs. Of course, some algorithms like machine learning
cannot be expressed in a high-level dataflow language. But in Hadoop, it is
always possible to drop down to the raw MapReduce programming level to
implement those algorithms [9]. In contrast, these kind of dataflows would
be difficult, if not impossible, to express in SQL.

It's worth noting that both Greenplum and Aster Data recently added support
for MapReduce functions in their shared-nothing RDBMS engines [10][11].

While this is an interesting development, it looks more limited than what
Pig, Jaql, and Cascading can provide in Hadoop. That's because calls to 
MapReduce functions have to be wrapped in SQL queries, making it difficult
to work with semi-structured data and program multi-step dataflows.

Implementation and Performance
------------------------------
Another criticism leveled at Hadoop was that it's a "poor implementation"
of a parallel query engine, with no indexing, hash-based methods, skew
handling, and so on. Moreover, the "pull' model to start jobs and exchange
data between nodes is strictly inferior to the "push" model used by most
shared-nothing RDBMS engines.

In fact, there is nothing to prevent indexing, hash-based methods, and skew
handling from being used in Hadoop. The codebase is still immature and
these techniques have not been a priority. On the problems where Hadoop is
most commonly used today, simple scans and sort-based methods are robust
and usually good enough. The duality between sort- and hash-based methods
is well known [17].

As far as the push vs. pull discussion goes, until somebody publishes a 
detailed performance study comparing the two, it is hard to say how much
performance degradation a pull model causes. DeWitt and Stonebraker did 
seem to concede that a pull model is probably a better "building block" for
fault tolerance, though. For example, in MapReduce, by temping the results
of mappers, and then having reducers pull those results across the network,
mappers can be restarted if they are slow or if their nodes fail.

Despite being an immature codebase and despite all the above shortcomings,
Hadoop was still able to break the TB sort record recently [12]. Google 
subsequently trumped Yahoo with their proprietary implementation of
MapReduce and went one step further by doing a PB sort [13] on 4,000
servers. Interestingly, no shared-nothing RDBMS seems to have ever held the
TB sort record... Perhaps the gray beards have something to learn about 
scalability from the young turks.

Novelty
-------
Yet another criticism leveled at MapReduce is that it's not novel, that 
Teradata was doing parallel groupby 20 years ago, and that UDAs and UDFs
appeared in Postgress in the mid 80s. What's new, however, is that
MapReduce is much more flexible, allowing semi-stuctured data types to be
handled more easily. More control is also provided over how data is
partitioned and how each partition is ordered and grouped. This control is
critical for certain kinds of data analysis. Finally, MapReduce also takes
care of fault tolerance, which is also critical for large scale data
analysis on commodity hardware. Aster Data's support of the MapReduce
framework, despite the fact that their engine was built on Postgres,
suggests the framework really does provide something novel.

[1] http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
[2] http://www.databasecolumn.com/2008/01/mapreduce-continued.html
[3] http://hadoop.apache.org/
[4] http://wiki.apache.org/pig
[5] http://hadoop.apache.org/hive/
[6] http://code.google.com/p/jaql/
[7] http://www.cascading.org/
[8] http://cloudbase.sourceforge.net
[9] http://lucene.apache.org/mahout/
[10] http://www.greenplum.com/resources/mapreduce/
[11]
http://www.asterdata.com/blog/index.php/2008/08/25/announcing-in-database-mapreduce/
[12]
http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html
[13] http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html
[14] http://developer.postgresql.org/pgdocs/postgres/xaggr.html
[15] C. Olsten, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, "Pig
Latin: A Not-So-Foreign Language for Data Processing," SIGMOD 2008.

[16] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, "Interpreting the
Data: Parallel Analysis with Sawzall," Scientific Programming Journal 13
(4), 2005.

[17] G. Graefe, A Linville, and Len Shapiro, "Sort vs. Hash Revisited," 
IEEE Trans. on Knowledge and Data Engineering, 1994.