Abstract
Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs.
We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude.
- Avro. http://avro.apache.org.Google Scholar
- Hadoop. http://hadoop.apache.org.Google Scholar
- Hive. http://hive.apache.org/.Google Scholar
- Jaql. http://code.google.com/p/jaql/.Google Scholar
- LZO. http://www.oberhumer.com/opensource/lzo/.Google Scholar
- Nutch. http://nutch.apache.org/.Google Scholar
- Protocol Buffers. http://code.google.com/p/protobuf/.Google Scholar
- Thrift. http://incubator.apache.org/thrift/.Google Scholar
- D. Abadi, S. R. Madden, and N. Hachem. Column-Stores vs. Row-Stores: How Different Are They Really? In SIGMOD, pages 967--980, 2008. Google ScholarDigital Library
- D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD, pages 671--682, 2006. Google ScholarDigital Library
- D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. Madden. Materialization Strategies in a Column-Oriented DBMS. In ICDE, pages 466--475, 2007.Google ScholarCross Ref
- A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
- A. Ailamaki, D. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In VLDB, pages 169--180, 2001. Google ScholarDigital Library
- P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, pages 225--237, 2005.Google Scholar
- Q. Chen et al. Efficiently Support MapReduce-like Computation Models Inside Parallel DBMS. In IDEAS, pages 43--53, 2009. Google ScholarDigital Library
- S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(2):1459--1468, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107--113, 2008. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53:72--77, January 2010. Google ScholarDigital Library
- J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah. PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
- Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011. Google ScholarDigital Library
- S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing Tuple Reconstruction in Column-Stores. In SIGMOD, pages 297--308, 2009. Google ScholarDigital Library
- E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization for MapReduce Programs. PVLDB, 4(6), 2011. Google ScholarDigital Library
- D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
- C. Lemka, K.-U. Sattler, F. Faerber, and A. Zeier. Speeding Up Queries in Column Stores. 6263:117--129, 2010. Google ScholarDigital Library
- S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010. Google ScholarDigital Library
- C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic Optimization of Parallel Dataflow Programs. In USENIX, pages 267--273, 2008. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarDigital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009. Google ScholarDigital Library
- W. Pugh. Skip Lists: A Probabilistic Alternative to Balanced Trees. CACM, 33(6):668--676, 1990. Google ScholarDigital Library
Index Terms
- Column-oriented storage techniques for MapReduce
Recommendations
HM: A Column-Oriented MapReduce System on Hybrid Storage
The solid-state hybrid drive (SSHD) incorporates a small NAND flash memory into a hard drive, resulting in an integrated device with combined Hard Disk Drive (HDD ) and Solid State Disk (SSD) storage. By identifying the data highly associated with the ...
Hybrid storage architecture and efficient MapReduce processing for unstructured data
We present a hybrid storage architecture which integrates various kinds of data stores.We propose three partitioning strategies to execute MapReduce-based batch-processing jobs.Our hybrid solution shows 10% to 8.6 times faster performance than the ...
Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications
GRID '11: Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid ComputingMapReduce is a promising parallel programming model for processing large data sets. Hadoop is an up-and-coming open-source implementation of MapReduce. It uses the Hadoop Distributed File System (HDFS) to store input and output data. Due to a lack of ...
Comments