research-article

A comparison of approaches to large-scale data analysis

Authors:
Andrew Pavlo

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Erik Paulson

University of Wisconsin, Madison, WI, USA

University of Wisconsin, Madison, WI, USA
View Profile

,
Alexander Rasin

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Daniel J. Abadi

Yale University, New Haven, CT, USA

Yale University, New Haven, CT, USA
View Profile

,
David J. DeWitt

Microsoft Inc., Madison, WI, USA

Microsoft Inc., Madison, WI, USA
View Profile

,
Samuel Madden

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Michael Stonebraker

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataJune 2009Pages 165–178https://doi.org/10.1145/1559845.1559865

Published:29 June 2009Publication History

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Pages 165–178

ABSTRACT

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

References

Hadoop. http://hadoop.apache.org/.Google Scholar
Hive. http://hadoop.apache.org/hive/.Google Scholar
Vertica. http://www.vertica.com/.Google Scholar
Y. Amir and J. Stanton. The Spread Wide Area Group Communication System. Technical report, 1998.Google Scholar
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008. Google ScholarDigital Library
Cisco Systems. Cisco Catalyst 3750-E Series Switches Data Sheet, June 2008.Google Scholar
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. Under Submission, March 2009.Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI '04, pages 10--10, 2004. Google ScholarDigital Library
D. J. DeWitt and R. H. Gerber. Multiprocessor Hash-based Join Algorithms. In VLDB '85, pages 151--164, 1985. Google ScholarDigital Library
D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A High Performance Dataflow Database Machine. In VLDB '86, pages 228--237, 1986. Google ScholarDigital Library
S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database Machine. In VLDB '86, pages 209--219, 1986. Google ScholarDigital Library
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. SIGOPS Oper. Syst. Rev., 37(5):29--43, 2003. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In EuroSys '07, pages 59--72, 2007. Google ScholarDigital Library
E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling object, relations and XML in the .NET framework. In SIGMOD '06, pages 706--706, 2006. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD '08, pages 1099--1110, 2008. Google ScholarDigital Library
J. Ong, D. Fogg, and M. Stonebraker. Implementation of data abstraction in the relational database system ingres. SIGMOD Rec., 14(1):1--14, 1983. Google ScholarDigital Library
D. A. Patterson. Technical Perspective: The Data Center is the Computer. Commun. ACM, 51(1):105--105, 2008. Google ScholarDigital Library
R. Rustin, editor. ACM--SIGMOD Workshop on Data Description, Access and Control, May 1974.Google Scholar
M. Stonebraker. The Case for Shared Nothing. Database Engineering, 9:4--9, 1986.Google Scholar
M. Stonebraker and J. Hellerstein. What Goes Around Comes Around. In Readings in Database Systems, pages 2--41. The MIT Press, 4th edition, 2005.Google Scholar
D. Thomas, D. Hansson, L. Breedt, M. Clark, J. D. Davidson, J. Gehtland, and A. Schwarz. Agile Web Development with Rails. Pragmatic Bookshelf, 2006. Google ScholarDigital Library

Index Terms

A comparison of approaches to large-scale data analysis
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers
Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)

The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore supercomputers provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the ...
Read More
HadoopDB in action: building real world applications
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

HadoopDB is a hybrid of MapReduce and DBMS technologies, designed to meet the growing demand of analyzing massive datasets on very large clusters of machines. Our previous work has shown that HadoopDB approaches parallel databases in performance and ...
Read More
MapReduce in MPI for Large-scale graph algorithms

We describe a parallel library written with message-passing (MPI) calls that allows algorithms to be expressed in the MapReduce paradigm. This means the calling program does not need to include explicit parallel code, but instead provides ''map'' and ''...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
June 2009
1168 pages
ISBN:9781605585512
DOI:10.1145/1559845
Editors:
Carsten Binnig,
Benoit Dageville,
General Chairs:
Uğur Çetintemel
Brown University, USA
,
Stan Zdonik
Brown University, USA
,
Program Chair:
Donald Kossmann
ETH Zurich, Switzerland
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Results Reproduced / v1.1
Author Tags
benchmarks
mapreduce
parallel database
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 682
  Total Citations
  View Citations
- 8,909
  Total Downloads
- Downloads (Last 12 months)276
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A comparison of approaches to large-scale data analysis

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers

HadoopDB in action: building real world applications

MapReduce in MPI for Large-scale graph algorithms