skip to main content
10.1145/1859184.1859192acmconferencesArticle/Chapter ViewAbstractPublication PagesladisConference Proceedingsconference-collections
research-article

Parallel bulk insertion for large-scale analytics applications

Published:28 July 2010Publication History

ABSTRACT

Modern data analytics applications, e.g. Internet-scale indexing, system trace analysis, recommender engines to name a few, operate on massive amounts of data and call for a parallel approach to data processing. In this work, we focus on the popular MapReduce framework to carry out such tasks and identify bulk data insert operations as a critical preliminary step to achieve reduced processing times, especially when new data is generated and processed at regular time intervals.

We present a parallel approach to bulk data insertion in a system that use horizontally range partitioned data and evaluate several variants to insertion operations, including legacy approaches. Our method exploits the parallel processing framework itself to insert data into the system, which is stored in a semi-structured format. Our results indicate that a parallel approach to bulk insertion can substantially reduce the recurrent costs of insertion of new data into the system.

References

  1. }}Apache ZooKeeper: hadoop.apache.org/zookeeper.Google ScholarGoogle Scholar
  2. }}Hadoop HBase: hadoop.apache.org/hbase.Google ScholarGoogle Scholar
  3. }}Hadoop project: hadoop.apache.org.Google ScholarGoogle Scholar
  4. }}Package org.apache.hadoop.hbase.mapreduce: hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html.Google ScholarGoogle Scholar
  5. }}A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Rasin, and A. Silberschatz. Hadoopdb: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB'09: Proceedings of the 2009 VLDB Endowment, Lyon, France, Aug. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI '06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, pages 15--15, Berkeley, CA, USA, Nov. 2006. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}Community Systems Group. Community systems research at yahoo! SIGMOD Rec., 36(3):47--54, Sept. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In SoCC '10: Proceedings of the 1st ACM Symposium on Cloud Computing, pages 143--154, New York, NY, USA, June 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. }}J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}J. Dean and S. Ghemawat. MapReduce: A flexible data processing tool, Jan. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. SIGOPS Operating Systems Review, 41(6):205--220, Sept. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. In LADIS '09: Proceedings of the 3nd Workshop on Large-Scale Distributed Systems and Middleware, Big Sky Resort, Big Sky, MT, Oct. 2009. ACM.Google ScholarGoogle Scholar
  13. }}A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 35th SIGMOD International Conference on Management of Data, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. }}A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerneni, and R. Ramakrishnan. Efficient bulk insertion into a distributed ordered table. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 765--778, New York, NY, USA, Aug. 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Parallel bulk insertion for large-scale analytics applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      LADIS '10: Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
      July 2010
      65 pages
      ISBN:9781450304061
      DOI:10.1145/1859184

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 July 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader