research-article

Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File System

Authors:
Hilmi Egemen Ciritoglu

University College Dublin, Dublin, Ireland

University College Dublin, Dublin, Ireland
View Profile

,
Leandro Batista de Almeida

University College Dublin&UTFPR, Dublin, Ireland

University College Dublin&UTFPR, Dublin, Ireland
View Profile

,
Eduardo Cunha de Almeida

Universidade Federal do Paraná, Curitiba, Brazil

Universidade Federal do Paraná, Curitiba, Brazil
View Profile

,
Teodora Sandra Buda

IBM Ireland, Dublin, Ireland

IBM Ireland, Dublin, Ireland
View Profile

,
John Murphy

University College Dublin, Dublin, Ireland

University College Dublin, Dublin, Ireland
View Profile

,
Christina Thorpe

University College Dublin, Dublin, Ireland

University College Dublin, Dublin, Ireland
View Profile

ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance EngineeringApril 2018Pages 135–140https://doi.org/10.1145/3185768.3186359

Published:02 April 2018Publication History

ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering

Pages 135–140

ABSTRACT

The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system; this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical Hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.

References

Cristina L Abad, Yi Lu, and Roy H Campbell. 2011. DARE: Adaptive data replication for efficient cluster scheduling. In Cluster Computing, 2011 IEEE International Conference on. IEEE, 159'168. Google ScholarDigital Library
Peter Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 61-76. Google ScholarDigital Library
Yanpei Chen, Sara Alspaugh, and Randy Katz. 2012. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment 5, 12 (2012), 1802-1813. Google ScholarDigital Library
Zhendong Cheng, Zhongzhi Luan, You Meng, Yijing Xu, Depei Qian, Alain Roy, Ning Zhang, and Gang Guan. 2012. ERMS: An elastic replication management system for hdfs. In Cluster Computing Workshops, 2012 IEEE International Conference on. IEEE, 32-40. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107-113. Google ScholarDigital Library
Avrilia Floratou, Umar Farooq Minhas, and Fatma Özcan. 2014. SQL-on-Hadoop: Full Circle Back to Shared-nothing Database Architectures. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1295-1306. Google ScholarDigital Library
Apache Software Foundation. 2018. Apache Hadoop. (2018). https://hadoop.apache.org.Google Scholar
Hui Jin, Xi Yang, Xian-He Sun, and Ioan Raicu. 2012. Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing. In Distributed Computing Systems, 2012 IEEE 32nd International Conference on. IEEE, 516-525. Google ScholarDigital Library
NOAA. 2018. NOAA Data set. (2018). https://www.ncdc.noaa.gov/data-access.Google Scholar
Meikel Poess and Chris Floyd. 2000. New TPC benchmarks for decision support and web commerce. ACM Sigmod Record 29, 4 (2000), 64-71. Google ScholarDigital Library
Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O'Shea, and Andrew Douglas. 2012. Nobody ever got fired for using Hadoop on a cluster. In Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing. ACM, 2. Google ScholarDigital Library
Jeffrey Shafer, Scott Rixner, and Alan L Cox. 2010. The hadoop distributed filesystem: Balancing portability and performance. In Performance Analysis of Systems & Software, 2010 IEEE International Symposium on. IEEE, 122-133.Google ScholarCross Ref
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Mass storage systems and technologies, 2010 IEEE 26th symposium on. IEEE, 1-10. Google ScholarDigital Library
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626-1629. Google ScholarDigital Library
Qingsong Wei, Bharadwaj Veeravalli, Bozhao Gong, Lingfang Zeng, and Dan Feng. 2010. CDRM: A cost-effective dynamic replication management scheme for cloud storage cluster. In Cluster Computing, 2010 IEEE International Conference on. IEEE, 188-196. Google ScholarDigital Library
Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. 2010. Improving mapreduce performance through data in heterogeneous hadoop clusters. In Parallel & Distributed Processing, Work- shops and Phd Forum, 2010 IEEE International Symposium on. IEEE, 1-9.Google Scholar

Index Terms

Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File System

Recommendations

Implementation of Distributed Searching and Sorting using Hadoop MapReduce
ICTCS '14: Proceedings of the 2014 International Conference on Information and Communication Technology for Competitive Strategies

This paper focuses on implementation of MapReduce programming model on Hadoop cluster for parallel processing of huge amount of data efficiently. There is deluge of data everywhere and we need to process these data efficiently to take decisions and to ...
Read More
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
HOTI '13: Proceedings of the 2013 IEEE 21st Annual Symposium on High-Performance Interconnects

The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The ...
Read More
A Robust and Light Weight Authentication Framework for Hadoop File System in Cloud Computing Environment
WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics

The advancement of web and mobile technologies results in the rapid augmentation of traditional enterprise data, IoT generated data, social media data which outcomes in peta bytes and exa bytes of structured and un structured data across clusters of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering
April 2018
212 pages
ISBN:9781450356299
DOI:10.1145/3185768
General Chairs:
Katinka Wolter
Free University of Berlin, Germany
,
Will Knottenbelt
Imperial College London, UK
,
Program Chairs:
André van Hoorn
University of Stuttgart, Germany
,
Manoj Nambiar
Tata Consultancy Services, India
,
Heiko Koziolek
ABB, Germany
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 April 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hadoop distributed file system
performance testing
replication factor
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate252of851submissions,30%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 260
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File System

ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Implementation of Distributed Searching and Sorting using Hadoop MapReduce

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

A Robust and Light Weight Authentication Framework for Hadoop File System in Cloud Computing Environment