skip to main content
10.1145/3185768.3186359acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article

Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File System

Published:02 April 2018Publication History

ABSTRACT

The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system; this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical Hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.

References

  1. Cristina L Abad, Yi Lu, and Roy H Campbell. 2011. DARE: Adaptive data replication for efficient cluster scheduling. In Cluster Computing, 2011 IEEE International Conference on. IEEE, 159'168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Peter Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 61-76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yanpei Chen, Sara Alspaugh, and Randy Katz. 2012. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment 5, 12 (2012), 1802-1813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Zhendong Cheng, Zhongzhi Luan, You Meng, Yijing Xu, Depei Qian, Alain Roy, Ning Zhang, and Gang Guan. 2012. ERMS: An elastic replication management system for hdfs. In Cluster Computing Workshops, 2012 IEEE International Conference on. IEEE, 32-40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107-113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Avrilia Floratou, Umar Farooq Minhas, and Fatma Özcan. 2014. SQL-on-Hadoop: Full Circle Back to Shared-nothing Database Architectures. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1295-1306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Apache Software Foundation. 2018. Apache Hadoop. (2018). https://hadoop.apache.org.Google ScholarGoogle Scholar
  8. Hui Jin, Xi Yang, Xian-He Sun, and Ioan Raicu. 2012. Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing. In Distributed Computing Systems, 2012 IEEE 32nd International Conference on. IEEE, 516-525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. NOAA. 2018. NOAA Data set. (2018). https://www.ncdc.noaa.gov/data-access.Google ScholarGoogle Scholar
  10. Meikel Poess and Chris Floyd. 2000. New TPC benchmarks for decision support and web commerce. ACM Sigmod Record 29, 4 (2000), 64-71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O'Shea, and Andrew Douglas. 2012. Nobody ever got fired for using Hadoop on a cluster. In Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing. ACM, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jeffrey Shafer, Scott Rixner, and Alan L Cox. 2010. The hadoop distributed filesystem: Balancing portability and performance. In Performance Analysis of Systems & Software, 2010 IEEE International Symposium on. IEEE, 122-133.Google ScholarGoogle ScholarCross RefCross Ref
  13. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Mass storage systems and technologies, 2010 IEEE 26th symposium on. IEEE, 1-10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626-1629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Qingsong Wei, Bharadwaj Veeravalli, Bozhao Gong, Lingfang Zeng, and Dan Feng. 2010. CDRM: A cost-effective dynamic replication management scheme for cloud storage cluster. In Cluster Computing, 2010 IEEE International Conference on. IEEE, 188-196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. 2010. Improving mapreduce performance through data in heterogeneous hadoop clusters. In Parallel & Distributed Processing, Work- shops and Phd Forum, 2010 IEEE International Symposium on. IEEE, 1-9.Google ScholarGoogle Scholar

Index Terms

  1. Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File System

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering
          April 2018
          212 pages
          ISBN:9781450356299
          DOI:10.1145/3185768

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 April 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate252of851submissions,30%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader