research-article

Column-oriented storage techniques for MapReduce

Authors:
Avrilia Floratou

University of Wisconsin--Madison

University of Wisconsin--Madison
View Profile

,
Jignesh M. Patel

University of Wisconsin--Madison

University of Wisconsin--Madison
View Profile

,
Eugene J. Shekita

IBM Almaden, Research Center

IBM Almaden, Research Center
View Profile

,
Sandeep Tata

IBM Almaden, Research Center

IBM Almaden, Research Center
View Profile

Proceedings of the VLDB Endowment Volume 4 Issue 7pp 419–429https://doi.org/10.14778/1988776.1988778

Published:01 April 2011Publication History

Proceedings of the VLDB Endowment

Abstract

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs.

We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude.

References

Avro. http://avro.apache.org.Google Scholar
Hadoop. http://hadoop.apache.org.Google Scholar
Hive. http://hive.apache.org/.Google Scholar
Jaql. http://code.google.com/p/jaql/.Google Scholar
LZO. http://www.oberhumer.com/opensource/lzo/.Google Scholar
Nutch. http://nutch.apache.org/.Google Scholar
Protocol Buffers. http://code.google.com/p/protobuf/.Google Scholar
Thrift. http://incubator.apache.org/thrift/.Google Scholar
D. Abadi, S. R. Madden, and N. Hachem. Column-Stores vs. Row-Stores: How Different Are They Really? In SIGMOD, pages 967--980, 2008. Google ScholarDigital Library
D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD, pages 671--682, 2006. Google ScholarDigital Library
D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. Madden. Materialization Strategies in a Column-Oriented DBMS. In ICDE, pages 466--475, 2007.Google ScholarCross Ref
A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
A. Ailamaki, D. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In VLDB, pages 169--180, 2001. Google ScholarDigital Library
P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, pages 225--237, 2005.Google Scholar
Q. Chen et al. Efficiently Support MapReduce-like Computation Models Inside Parallel DBMS. In IDEAS, pages 43--53, 2009. Google ScholarDigital Library
S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(2):1459--1468, 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107--113, 2008. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53:72--77, January 2010. Google ScholarDigital Library
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah. PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011. Google ScholarDigital Library
S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing Tuple Reconstruction in Column-Stores. In SIGMOD, pages 297--308, 2009. Google ScholarDigital Library
E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization for MapReduce Programs. PVLDB, 4(6), 2011. Google ScholarDigital Library
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
C. Lemka, K.-U. Sattler, F. Faerber, and A. Zeier. Speeding Up Queries in Column Stores. 6263:117--129, 2010. Google ScholarDigital Library
S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010. Google ScholarDigital Library
C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic Optimization of Parallel Dataflow Programs. In USENIX, pages 267--273, 2008. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009. Google ScholarDigital Library
W. Pugh. Skip Lists: A Probabilistic Alternative to Balanced Trees. CACM, 33(6):668--676, 1990. Google ScholarDigital Library

Index Terms

Column-oriented storage techniques for MapReduce

Recommendations

HM: A Column-Oriented MapReduce System on Hybrid Storage
The solid-state hybrid drive (SSHD) incorporates a small NAND flash memory into a hard drive, resulting in an integrated device with combined Hard Disk Drive (HDD ) and Solid State Disk (SSD) storage. By identifying the data highly associated with the ...
Read More
Hybrid storage architecture and efficient MapReduce processing for unstructured data

We present a hybrid storage architecture which integrates various kinds of data stores.We propose three partitioning strategies to execute MapReduce-based batch-processing jobs.Our hybrid solution shows 10% to 8.6 times faster performance than the ...
Read More
Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications
GRID '11: Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing

MapReduce is a promising parallel programming model for processing large data sets. Hadoop is an up-and-coming open-source implementation of MapReduce. It uses the Hadoop Distributed File System (HDFS) to store input and output data. Due to a lack of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 4, Issue 7
April 2011
61 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 April 2011
Published in pvldb Volume 4, Issue 7
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 41
  Total Citations
  View Citations
- 1,074
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

HM: A Column-Oriented MapReduce System on Hybrid Storage

Hybrid storage architecture and efficient MapReduce processing for unstructured data

Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

HM: A Column-Oriented MapReduce System on Hybrid Storage

Hybrid storage architecture and efficient MapReduce processing for unstructured data

Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media