skip to main content
research-article

Column-oriented storage techniques for MapReduce

Published:01 April 2011Publication History
Skip Abstract Section

Abstract

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs.

We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude.

References

  1. Avro. http://avro.apache.org.Google ScholarGoogle Scholar
  2. Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  3. Hive. http://hive.apache.org/.Google ScholarGoogle Scholar
  4. Jaql. http://code.google.com/p/jaql/.Google ScholarGoogle Scholar
  5. LZO. http://www.oberhumer.com/opensource/lzo/.Google ScholarGoogle Scholar
  6. Nutch. http://nutch.apache.org/.Google ScholarGoogle Scholar
  7. Protocol Buffers. http://code.google.com/p/protobuf/.Google ScholarGoogle Scholar
  8. Thrift. http://incubator.apache.org/thrift/.Google ScholarGoogle Scholar
  9. D. Abadi, S. R. Madden, and N. Hachem. Column-Stores vs. Row-Stores: How Different Are They Really? In SIGMOD, pages 967--980, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD, pages 671--682, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. Madden. Materialization Strategies in a Column-Oriented DBMS. In ICDE, pages 466--475, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Ailamaki, D. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In VLDB, pages 169--180, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, pages 225--237, 2005.Google ScholarGoogle Scholar
  15. Q. Chen et al. Efficiently Support MapReduce-like Computation Models Inside Parallel DBMS. In IDEAS, pages 43--53, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(2):1459--1468, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53:72--77, January 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah. PVLDB, 3(1):518--529, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing Tuple Reconstruction in Column-Stores. In SIGMOD, pages 297--308, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization for MapReduce Programs. PVLDB, 4(6), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1):472--483, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Lemka, K.-U. Sattler, F. Faerber, and A. Zeier. Speeding Up Queries in Column Stores. 6263:117--129, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic Optimization of Parallel Dataflow Programs. In USENIX, pages 267--273, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Pugh. Skip Lists: A Probabilistic Alternative to Balanced Trees. CACM, 33(6):668--676, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Column-oriented storage techniques for MapReduce

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image Proceedings of the VLDB Endowment
              Proceedings of the VLDB Endowment  Volume 4, Issue 7
              April 2011
              61 pages

              Publisher

              VLDB Endowment

              Publication History

              • Published: 1 April 2011
              Published in pvldb Volume 4, Issue 7

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader