MapReduce: A step backwards, perhaps, but maybe a couple of steps forward, too Eugene Shekita, IBM Almaden Introduction ------------ In January 2008, David DeWitt and Michael Stonebraker wrote a provocative blog titled "MapReduce: A major step backwards" [1][2]. The blog slammed MapReduce on several fronts, arguing that for large scale data analysis, the MapReduce framework (and in particular the open-source implementation of it in Hadoop [3]), is not novel and inferior to commercial shared-nothing RDBMS engines like Teradata, DB2, and Vertica. Moreover, the authors accused the Hadoop community of engaging in hype and ignoring 25 years of parallel DBMS research. A small flame war erupted on the blog, with a lot of the Hadoop supporters (the "young turks") basically arguing that 'these old database guys (the "gray beards") just don't get it.' Slightly more than a year has past since DeWitt and Stonebraker wrote their blog. It's interesting to revisit some of the criticism leveled at MapReduce (and Hadoop in particular) to see whether those young turks are learning from their elders and whether they may be onto something after all. Data Models, Schemas, and Query Languages ----------------------------------------- One criticism leveled at MapReduce was that it's too low level -- that it lacks a real data model, a schema, and a catalog that can be queried; worse, it lacks a high level query language. The database community learned the importance of all these things about 30 years ago, of course. When they wrote their blog, DeWitt and Stonebraker did mention that the Pig project [4] at Yahoo! Research, which adds a semi-structured data model and high level query language to Hadoop, was a step in the right direction. Since then Pig has become much more complete, other query languages for Hadoop have also emerged, including Facebook's Hive [5], IBM's Jaql [6] , Cascading [7], and Cloudbase [8]. It's clear from all this activity that Hadoop will eventually have a real data model, schema, catalogs, and query language -- probably several of them. The young turks are learning. What's also interesting is that many of the emerging data models and query languages have a distinctly different flavor than SQL. For example, Pig, Jaql, and Cascading all support semi-structured data and are really more like high-level, parallel data flow languages than declarative query languages. Are the young turks onto something here? Many analysis problems require semi-structured data and are more naturally expressed as a sequence of dataflow steps [15][16]. Languages like Pig, Jaql, and Cascading enable these dataflows to be described at a high level and then compiled into parallel MapReduce jobs. Of course, some algorithms like machine learning cannot be expressed in a high-level dataflow language. But in Hadoop, it is always possible to drop down to the raw MapReduce programming level to implement those algorithms [9]. In contrast, these kind of dataflows would be difficult, if not impossible, to express in SQL. It's worth noting that both Greenplum and Aster Data recently added support for MapReduce functions in their shared-nothing RDBMS engines [10][11]. While this is an interesting development, it looks more limited than what Pig, Jaql, and Cascading can provide in Hadoop. That's because calls to MapReduce functions have to be wrapped in SQL queries, making it difficult to work with semi-structured data and program multi-step dataflows. Implementation and Performance ------------------------------ Another criticism leveled at Hadoop was that it's a "poor implementation" of a parallel query engine, with no indexing, hash-based methods, skew handling, and so on. Moreover, the "pull' model to start jobs and exchange data between nodes is strictly inferior to the "push" model used by most shared-nothing RDBMS engines. In fact, there is nothing to prevent indexing, hash-based methods, and skew handling from being used in Hadoop. The codebase is still immature and these techniques have not been a priority. On the problems where Hadoop is most commonly used today, simple scans and sort-based methods are robust and usually good enough. The duality between sort- and hash-based methods is well known [17]. As far as the push vs. pull discussion goes, until somebody publishes a detailed performance study comparing the two, it is hard to say how much performance degradation a pull model causes. DeWitt and Stonebraker did seem to concede that a pull model is probably a better "building block" for fault tolerance, though. For example, in MapReduce, by temping the results of mappers, and then having reducers pull those results across the network, mappers can be restarted if they are slow or if their nodes fail. Despite being an immature codebase and despite all the above shortcomings, Hadoop was still able to break the TB sort record recently [12]. Google subsequently trumped Yahoo with their proprietary implementation of MapReduce and went one step further by doing a PB sort [13] on 4,000 servers. Interestingly, no shared-nothing RDBMS seems to have ever held the TB sort record... Perhaps the gray beards have something to learn about scalability from the young turks. Novelty ------- Yet another criticism leveled at MapReduce is that it's not novel, that Teradata was doing parallel groupby 20 years ago, and that UDAs and UDFs appeared in Postgress in the mid 80s. What's new, however, is that MapReduce is much more flexible, allowing semi-stuctured data types to be handled more easily. More control is also provided over how data is partitioned and how each partition is ordered and grouped. This control is critical for certain kinds of data analysis. Finally, MapReduce also takes care of fault tolerance, which is also critical for large scale data analysis on commodity hardware. Aster Data's support of the MapReduce framework, despite the fact that their engine was built on Postgres, suggests the framework really does provide something novel. [1] http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html [2] http://www.databasecolumn.com/2008/01/mapreduce-continued.html [3] http://hadoop.apache.org/ [4] http://wiki.apache.org/pig [5] http://hadoop.apache.org/hive/ [6] http://code.google.com/p/jaql/ [7] http://www.cascading.org/ [8] http://cloudbase.sourceforge.net [9] http://lucene.apache.org/mahout/ [10] http://www.greenplum.com/resources/mapreduce/ [11] http://www.asterdata.com/blog/index.php/2008/08/25/announcing-in-database-mapreduce/ [12] http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html [13] http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html [14] http://developer.postgresql.org/pgdocs/postgres/xaggr.html [15] C. Olsten, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, "Pig Latin: A Not-So-Foreign Language for Data Processing," SIGMOD 2008. [16] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, "Interpreting the Data: Parallel Analysis with Sawzall," Scientific Programming Journal 13 (4), 2005. [17] G. Graefe, A Linville, and Len Shapiro, "Sort vs. Hash Revisited," IEEE Trans. on Knowledge and Data Engineering, 1994.