skip to main content
10.1145/1559845.1559865acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Results Reproduced / v1.1

A comparison of approaches to large-scale data analysis

Published:29 June 2009Publication History

ABSTRACT

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

References

  1. Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  2. Hive. http://hadoop.apache.org/hive/.Google ScholarGoogle Scholar
  3. Vertica. http://www.vertica.com/.Google ScholarGoogle Scholar
  4. Y. Amir and J. Stanton. The Spread Wide Area Group Communication System. Technical report, 1998.Google ScholarGoogle Scholar
  5. R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cisco Systems. Cisco Catalyst 3750-E Series Switches Data Sheet, June 2008.Google ScholarGoogle Scholar
  7. J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. Under Submission, March 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI '04, pages 10--10, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. J. DeWitt and R. H. Gerber. Multiprocessor Hash-based Join Algorithms. In VLDB '85, pages 151--164, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A High Performance Dataflow Database Machine. In VLDB '86, pages 228--237, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database Machine. In VLDB '86, pages 209--219, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. SIGOPS Oper. Syst. Rev., 37(5):29--43, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In EuroSys '07, pages 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling object, relations and XML in the .NET framework. In SIGMOD '06, pages 706--706, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD '08, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Ong, D. Fogg, and M. Stonebraker. Implementation of data abstraction in the relational database system ingres. SIGMOD Rec., 14(1):1--14, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. A. Patterson. Technical Perspective: The Data Center is the Computer. Commun. ACM, 51(1):105--105, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Rustin, editor. ACM--SIGMOD Workshop on Data Description, Access and Control, May 1974.Google ScholarGoogle Scholar
  19. M. Stonebraker. The Case for Shared Nothing. Database Engineering, 9:4--9, 1986.Google ScholarGoogle Scholar
  20. M. Stonebraker and J. Hellerstein. What Goes Around Comes Around. In Readings in Database Systems, pages 2--41. The MIT Press, 4th edition, 2005.Google ScholarGoogle Scholar
  21. D. Thomas, D. Hansson, L. Breedt, M. Clark, J. D. Davidson, J. Gehtland, and A. Schwarz. Agile Web Development with Rails. Pragmatic Bookshelf, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A comparison of approaches to large-scale data analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
      June 2009
      1168 pages
      ISBN:9781605585512
      DOI:10.1145/1559845

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader