ABSTRACT
The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system; this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical Hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.
- Cristina L Abad, Yi Lu, and Roy H Campbell. 2011. DARE: Adaptive data replication for efficient cluster scheduling. In Cluster Computing, 2011 IEEE International Conference on. IEEE, 159'168. Google ScholarDigital Library
- Peter Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 61-76. Google ScholarDigital Library
- Yanpei Chen, Sara Alspaugh, and Randy Katz. 2012. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment 5, 12 (2012), 1802-1813. Google ScholarDigital Library
- Zhendong Cheng, Zhongzhi Luan, You Meng, Yijing Xu, Depei Qian, Alain Roy, Ning Zhang, and Gang Guan. 2012. ERMS: An elastic replication management system for hdfs. In Cluster Computing Workshops, 2012 IEEE International Conference on. IEEE, 32-40. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107-113. Google ScholarDigital Library
- Avrilia Floratou, Umar Farooq Minhas, and Fatma Özcan. 2014. SQL-on-Hadoop: Full Circle Back to Shared-nothing Database Architectures. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1295-1306. Google ScholarDigital Library
- Apache Software Foundation. 2018. Apache Hadoop. (2018). https://hadoop.apache.org.Google Scholar
- Hui Jin, Xi Yang, Xian-He Sun, and Ioan Raicu. 2012. Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing. In Distributed Computing Systems, 2012 IEEE 32nd International Conference on. IEEE, 516-525. Google ScholarDigital Library
- NOAA. 2018. NOAA Data set. (2018). https://www.ncdc.noaa.gov/data-access.Google Scholar
- Meikel Poess and Chris Floyd. 2000. New TPC benchmarks for decision support and web commerce. ACM Sigmod Record 29, 4 (2000), 64-71. Google ScholarDigital Library
- Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O'Shea, and Andrew Douglas. 2012. Nobody ever got fired for using Hadoop on a cluster. In Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing. ACM, 2. Google ScholarDigital Library
- Jeffrey Shafer, Scott Rixner, and Alan L Cox. 2010. The hadoop distributed filesystem: Balancing portability and performance. In Performance Analysis of Systems & Software, 2010 IEEE International Symposium on. IEEE, 122-133.Google ScholarCross Ref
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Mass storage systems and technologies, 2010 IEEE 26th symposium on. IEEE, 1-10. Google ScholarDigital Library
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626-1629. Google ScholarDigital Library
- Qingsong Wei, Bharadwaj Veeravalli, Bozhao Gong, Lingfang Zeng, and Dan Feng. 2010. CDRM: A cost-effective dynamic replication management scheme for cloud storage cluster. In Cluster Computing, 2010 IEEE International Conference on. IEEE, 188-196. Google ScholarDigital Library
- Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. 2010. Improving mapreduce performance through data in heterogeneous hadoop clusters. In Parallel & Distributed Processing, Work- shops and Phd Forum, 2010 IEEE International Symposium on. IEEE, 1-9.Google Scholar
Index Terms
- Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File System
Recommendations
Implementation of Distributed Searching and Sorting using Hadoop MapReduce
ICTCS '14: Proceedings of the 2014 International Conference on Information and Communication Technology for Competitive StrategiesThis paper focuses on implementation of MapReduce programming model on Hadoop cluster for parallel processing of huge amount of data efficiently. There is deluge of data everywhere and we need to process these data efficiently to take decisions and to ...
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
HOTI '13: Proceedings of the 2013 IEEE 21st Annual Symposium on High-Performance InterconnectsThe Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The ...
A Robust and Light Weight Authentication Framework for Hadoop File System in Cloud Computing Environment
WCI '15: Proceedings of the Third International Symposium on Women in Computing and InformaticsThe advancement of web and mobile technologies results in the rapid augmentation of traditional enterprise data, IoT generated data, social media data which outcomes in peta bytes and exa bytes of structured and un structured data across clusters of ...
Comments