ABSTRACT
Modern data analytics applications, e.g. Internet-scale indexing, system trace analysis, recommender engines to name a few, operate on massive amounts of data and call for a parallel approach to data processing. In this work, we focus on the popular MapReduce framework to carry out such tasks and identify bulk data insert operations as a critical preliminary step to achieve reduced processing times, especially when new data is generated and processed at regular time intervals.
We present a parallel approach to bulk data insertion in a system that use horizontally range partitioned data and evaluate several variants to insertion operations, including legacy approaches. Our method exploits the parallel processing framework itself to insert data into the system, which is stored in a semi-structured format. Our results indicate that a parallel approach to bulk insertion can substantially reduce the recurrent costs of insertion of new data into the system.
- }}Apache ZooKeeper: hadoop.apache.org/zookeeper.Google Scholar
- }}Hadoop HBase: hadoop.apache.org/hbase.Google Scholar
- }}Hadoop project: hadoop.apache.org.Google Scholar
- }}Package org.apache.hadoop.hbase.mapreduce: hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html.Google Scholar
- }}A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Rasin, and A. Silberschatz. Hadoopdb: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB'09: Proceedings of the 2009 VLDB Endowment, Lyon, France, Aug. 2009. Google ScholarDigital Library
- }}F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI '06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, pages 15--15, Berkeley, CA, USA, Nov. 2006. USENIX Association. Google ScholarDigital Library
- }}Community Systems Group. Community systems research at yahoo! SIGMOD Rec., 36(3):47--54, Sept. 2007. Google ScholarDigital Library
- }}B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In SoCC '10: Proceedings of the 1st ACM Symposium on Cloud Computing, pages 143--154, New York, NY, USA, June 2010. ACM. Google ScholarDigital Library
- }}J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- }}J. Dean and S. Ghemawat. MapReduce: A flexible data processing tool, Jan. 2010. Google ScholarDigital Library
- }}G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. SIGOPS Operating Systems Review, 41(6):205--220, Sept. 2007. Google ScholarDigital Library
- }}A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. In LADIS '09: Proceedings of the 3nd Workshop on Large-Scale Distributed Systems and Middleware, Big Sky Resort, Big Sky, MT, Oct. 2009. ACM.Google Scholar
- }}A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 35th SIGMOD International Conference on Management of Data, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- }}A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerneni, and R. Ramakrishnan. Efficient bulk insertion into a distributed ordered table. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 765--778, New York, NY, USA, Aug. 2008. ACM. Google ScholarDigital Library
Index Terms
- Parallel bulk insertion for large-scale analytics applications
Recommendations
Large-scale multilevel streaming data analytics
CASCON '18: Proceedings of the 28th Annual International Conference on Computer Science and Software EngineeringThere is a monumental shift happening in how data powers organizational and business operations. This shift is about moving away from traditional batch and real-time analytics to hybrid analytics involving both static and continuous data. Most analytics ...
Large-scale complex analytics on semi-structured datasets using asterixDB and spark
Large quantities of raw data are being generated by many different sources in different formats. Private and public sectors alike acclaim the valuable information and insights that can be mined from such data to better understand the dynamics of ...
Comments