research-article

Parallel bulk insertion for large-scale analytics applications

Authors:
Antonio Barbuzzi

Politecnico di Bari

Politecnico di Bari
View Profile

,
Pietro Michiardi

Eurecom

Eurecom
View Profile

,
Ernst Biersack

Eurecom

Eurecom
View Profile

,
Gennaro Boggia

Politecnico di Bari

Politecnico di Bari
View Profile

LADIS '10: Proceedings of the 4th International Workshop on Large Scale Distributed Systems and MiddlewareJuly 2010Pages 27–31https://doi.org/10.1145/1859184.1859192

Published:28 July 2010Publication History

LADIS '10: Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware

Pages 27–31

ABSTRACT

Modern data analytics applications, e.g. Internet-scale indexing, system trace analysis, recommender engines to name a few, operate on massive amounts of data and call for a parallel approach to data processing. In this work, we focus on the popular MapReduce framework to carry out such tasks and identify bulk data insert operations as a critical preliminary step to achieve reduced processing times, especially when new data is generated and processed at regular time intervals.

We present a parallel approach to bulk data insertion in a system that use horizontally range partitioned data and evaluate several variants to insertion operations, including legacy approaches. Our method exploits the parallel processing framework itself to insert data into the system, which is stored in a semi-structured format. Our results indicate that a parallel approach to bulk insertion can substantially reduce the recurrent costs of insertion of new data into the system.

References

}}Apache ZooKeeper: hadoop.apache.org/zookeeper.Google Scholar
}}Hadoop HBase: hadoop.apache.org/hbase.Google Scholar
}}Hadoop project: hadoop.apache.org.Google Scholar
}}Package org.apache.hadoop.hbase.mapreduce: hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html.Google Scholar
}}A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Rasin, and A. Silberschatz. Hadoopdb: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB'09: Proceedings of the 2009 VLDB Endowment, Lyon, France, Aug. 2009. Google ScholarDigital Library
}}F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI '06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, pages 15--15, Berkeley, CA, USA, Nov. 2006. USENIX Association. Google ScholarDigital Library
}}Community Systems Group. Community systems research at yahoo! SIGMOD Rec., 36(3):47--54, Sept. 2007. Google ScholarDigital Library
}}B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In SoCC '10: Proceedings of the 1st ACM Symposium on Cloud Computing, pages 143--154, New York, NY, USA, June 2010. ACM. Google ScholarDigital Library
}}J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
}}J. Dean and S. Ghemawat. MapReduce: A flexible data processing tool, Jan. 2010. Google ScholarDigital Library
}}G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. SIGOPS Operating Systems Review, 41(6):205--220, Sept. 2007. Google ScholarDigital Library
}}A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. In LADIS '09: Proceedings of the 3nd Workshop on Large-Scale Distributed Systems and Middleware, Big Sky Resort, Big Sky, MT, Oct. 2009. ACM.Google Scholar
}}A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 35th SIGMOD International Conference on Management of Data, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
}}A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerneni, and R. Ramakrishnan. Efficient bulk insertion into a distributed ordered table. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 765--778, New York, NY, USA, Aug. 2008. ACM. Google ScholarDigital Library

Index Terms

Parallel bulk insertion for large-scale analytics applications
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Large-scale multilevel streaming data analytics
CASCON '18: Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering

There is a monumental shift happening in how data powers organizational and business operations. This shift is about moving away from traditional batch and real-time analytics to hybrid analytics involving both static and continuous data. Most analytics ...
Read More
Large-scale complex analytics on semi-structured datasets using asterixDB and spark

Large quantities of raw data are being generated by many different sources in different formats. Private and public sectors alike acclaim the valuable information and insights that can be mined from such data to better understand the dynamics of ...
Read More
Large-Scale Data Analytics
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
LADIS '10: Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
July 2010
65 pages
ISBN:9781450304061
DOI:10.1145/1859184
General Chairs:
Gregory Chockler
IBM Research Haifa, Israel
,
Ymir Vigfusson
IBM Research Haifa, Israel
,
Program Chairs:
Marcos K. Aguilera
Microsoft Research, USA
,
Marc Shapiro
INRIA & LIP6, France
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 201
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Parallel bulk insertion for large-scale analytics applications

LADIS '10: Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware

ABSTRACT

References

Cited By

Index Terms

Recommendations

Large-scale multilevel streaming data analytics

Large-scale complex analytics on semi-structured datasets using asterixDB and spark

Large-Scale Data Analytics