A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop
Introduction
Nowadays, the ability to analyze the big data repositories remains a problem in many modern enterprises and research societies (Gonçalves et al., 2017; Khezr and Navimipour, 2017). Every day, large amounts of data are produced from numerous sources e.g. sensors, digital pictures, videos, purchase transaction records, and cell phone, but mining suitable information for making an appropriate decision from these massive data repositories is almost impractical for the traditional database management system (DBMS) technologies (Cuzzocrea et al., 2011).
The Hadoop as an applicable solution to big data (Rashmi and Basu, 2017) provides reliable, fault-tolerant, scalable, and efficient services for large amounts of data processing using MapReduce (Uzunkaya et al., 2015; Zhao, 2017). The simple programming interface, high scalability and the capability of processing a high amount of data in the distributed processing environments are considered as its main features (Khezr and Navimipour, 2017). The MapReduce has an important role in performing a very large number of data-intensive applications (Cassales et al., 2015; Chelliah, 2017).
Recently, the significant issue in the data centers is the efficiency of energy (Khan et al., 2016; Kurpicz et al., 2018). According to U.S. department of energy report in 2014, the U.S data centers spent about 70 billion kWh (1.8% of entire U.S. electricity consumption). It is estimated that about 73 billion kWh will be consumed by U.S data centers in 2020 (Shehabi et al., 2016). Given the environmental challenges and the limited resources of energy and high energy costs (Akhter and Othman, 2016; Babar et al., 2017), hardware and software techniques should be used to reduce the energy consumption. As a result, the energy reduction is a big challenge for Hadoop which consists of the large cluster (Usama et al., 2017).
This paper is a key systematic one about the energy utilization techniques for Hadoop. It discusses different software methods, such as scheduling, and hardware methods such as Dynamic Voltage and Frequency Scaling (DVFS) which are employed to reduce the energy consumption. Providing the conceptual aspects of energy efficiency in Hadoop is the main goal of this paper. The contributions of this study are listed below:
- •
Providing a review on the current energy-aware methods for Hadoop.
- •
Dividing energy efficiency techniques into two main classes, including software-based and hardware-based techniques.
- •
Providing the benefits and drawbacks of the existing energy efficiency techniques for Hadoop.
- •
Discussing and comparing the main challenges for the energy efficiency in the Hadoop.
- •
Highlighting the guidelines for future research and open issues about the energy efficiency in the Hadoop.
Furthermore, the techniques are compared in this paper using some performance measures and the Quality of Service (QoS) parameters (Conejero et al., 2016) such as data locality, fault tolerance, heterogeneity, scalability, makespan, performance, cost, and load balancing. Therefore, we provide a brief discussion of them.
- •
Data Locality: It means moving computation close to data rather than moving data towards computation (George et al., 2016).
- •
Heterogeneity: In a heterogeneous data center, there are some nodes with dissimilar abilities such as computing power (Rasooli and Down, 2014b).
- •
Fault tolerance: It provides continuous and correct operation of a system in the presence of the failure of its component(s) (Sampaio and Barbosa, 2018).
- •
Scalability: It is the capacity to be changed and reformed in numerous situations in a Hadoop cluster (Zhang et al., 2018).
- •
Makespan: It is the time variance between the beginning and the end of the job or task sequence (Kalra and Singh, 2015).
- •
Load balancing: It enhances the distribution of loads across multiple computing resources (Gao and Yu, 2017; Ghomi et al., 2017).
- •
Cost: Two types of costs can be considered, one is in term of manpower and the other in term of money (Majeed and Shah, 2015).
- •
Performance: The amount of useful estimated fulfilled work in terms of time needed, used resources, etc. (Cheng et al., 2017a).
The arrangement of various sections of the articles is as follows: Hadoop and its components are presented in Section 2. Section 3 reviews the related work. Section 4 provides the research selection process and a Systematic Literature Review (SLR). Section 5 systematically overviews the energy efficiency approaches in the Hadoop and classifies them. Furthermore, this section provides a comparison of the methods of the selected articles. Section 6 discusses the obtained findings. Some open issues are elaborated in Section 7. Finally, Section 8 presents the conclusion in addition to the paper limitations.
Section snippets
Background
The Google's MapReduce and Google File System (GFS) model are performed by Apache Hadoop (Cassales et al., 2016; Li et al., 2017; Park et al., 2016; Qin et al., 2017; Veiga et al., 2018) that supports the storing and processing of big datasets. It has attracted the attention of both the industrial communities and academic due to its open source solution (Polato et al., 2014). The Hadoop framework is classified as follows:
Motivation and related work
Some related works on Hadoop, MapReduce, and energy issues are discussed briefly in this section.
Majeed and Shah (2015) have presented a survey according to the state-of-the-art on some techniques and architectures of the energy efficiency in big data during 2007–2015. First of all, they have considered the existing surveys on energy consumption utilization. Then, they have categorized the research papers in terms of a hardware-based, component-based and the best energy efficiency methods that
Research methodology
The SLR is offered in this section to improve the understanding of the energy efficiency techniques in the Hadoop. All examination that addresses a specific issue is analyzed by SLR which is a critical assessment (Navimipour and Charband, 2016; Soltani and Navimipour, 2016). The article classification and selection process as two parts of the search process are discussed in the next subsections.
Energy efficiency techniques in Hadoop
The present section describes the differences, advantages, and disadvantages of the main state-of-the-art energy efficiency mechanisms in the Hadoop. We review software-based and hardware-based articles for reducing the energy consumption in the Hadoop. These articles have applied software or hardware techniques, or both.
Discussion
In the previous sections, we discussed the energy efficiency techniques of Hadoop in two main groups: software-based and hardware-based techniques. Now, a statistical analysis of declared techniques regarding the energy efficiency in the Hadoop is going to be considered. Table 4 and Table 5 show the main properties of the discussed methods like kind of Hadoop environment, the platform of implementation or simulation in software-based and hardware-based techniques, respectively. Also, Fig. 8
Open challenges and future work
Future works should consider some important challenges. The mentioned issues are discussed and investigated in this section. In the rest of this section, some important directions are provided for future researches.
- •
Heterogeneity as the main cause of performance variability is available in the hardware and workload characteristics. The performance and energy consumption vary by performing the same task on various nodes. Some factors such as type of workload and the rate of Hadoop's tasks can
Conclusion and limitation
This paper refers to survey the previous and the present mechanisms for energy efficiency in the Hadoop systematically. First, we have overviewed Hadoop and its components. Then, we have explained the research methodology and have classified 22 selected articles into two groups that 11 of them are software-based approach and 11 of them are the hardware-based approach. Also, important methods of each category and their advantages and disadvantages are discussed. The reason behind addressing the
Fatemeh Shabestari received his B.S. in computer engineering, software, from Shabestar Branch, Islamic Azad University, Shabestar, Iran, in 2005 and the M.S. in computer engineering, software, from Shabestar Branch, Islamic Azad University, Shabestar, Iran, in 2009. She is currently a Ph.D. candidate in computer engineering at Science and Research Branch, Islamic Azad University, Tehran, Iran. Her research interests include big data and green computing.
References (112)
- et al.
Dynamic power management techniques in multi-core architectures: a survey study
Ain Shams Eng. J.
(2017) - et al.
A taxonomy and survey of energy-efficient data centers and cloud computing systems
Adv. Comput.
(2011) - et al.
Context-aware scheduling for Apache hadoop over pervasive environments
Procedia Comput. Sci.
(2015) - et al.
Analyzing Hadoop power consumption and impact on application QoS
Future Generat. Comput. Syst.
(2016) - et al.
Resource management in cloud platform as a service systems: analysis and opportunities
J. Syst. Software
(2017) - et al.
FARMS: efficient mapreduce speculation for failure recovery in short jobs
Parallel Comput.
(2017) - et al.
Mapreduce performance model for Hadoop 2. x
Info. Syst.
(2019) - et al.
Towards of a real-time big data architecture to intensive care
Procedia Comput. Sci.
(2017) - et al.
Governing energy consumption in hadoop through cpu frequency scaling: an analysis
Future Generat. Comput. Syst.
(2016) - et al.
A review of metaheuristic scheduling techniques in cloud computing
Egypt. Inf. J.
(2015)
Data-locality-aware mapreduce real-time scheduling framework
J. Syst. Software
Energy-proportional profiling and accounting in heterogeneous virtualized environments
Sustain. Comput. Info. Syst.
A RAM triage methodology for Hadoop HDFS forensics
Digit. Invest.
InSTechAH: cost-effectively autoscaling smart computing hadoop cluster in private cloud
J. Syst. Architect.
Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark
J. Syst. Software
Energy-efficient acceleration of MapReduce applications using FPGAs
J. Parallel Distr. Comput.
Towards efficient resource provisioning in MapReduce
J. Parallel Distr. Comput.
Big Data technologies: A survey
J. King Saud Univ. Comput. Info. Sci.
A comprehensive view of Hadoop research—a systematic literature review
J. Netw. Comput. Appl.
COSHH: a classification and optimization based scheduler for heterogeneous Hadoop systems
Future Generat. Comput. Syst.
Scalable system scheduling for HPC and big data
J. Parallel Distr. Comput.
Some observations on optimal frequency selection in DVFS-based energy consumption minimization
J. Parallel Distr. Comput.
A comparative cost analysis of fault-tolerance mechanisms for availability on the cloud
Sustain. Comput. Info. Syst.
Efficient jobs scheduling approach for big data applications
Comput. Ind. Eng.
Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster
Comput. Elect. Eng.
Customer relationship management mechanisms: a systematic review of the state of the art literature and recommendations for future research
Comput. Hum. Behav.
Modulo based data placement algorithm for energy consumption optimization of MapReduce system
J. Grid Comput.
Task scheduling in big data platforms: a systematic literature review
J. Syst. Software
Storage tier-aware replicative data reorganization with prioritization for efficient workload processing
Future Generat. Comput. Syst.
Job schedulers for Big data processing in Hadoop environment: testing real-life schedulers using benchmark programs
Digital Commun. Netw.
Hadoop ecosystem and its analysis on tweets
Procedia Soc. Behav. Sci.
Deadline scheduling algorithm for sustainable computing in Hadoop environment
Comput. Secur.
BDEv 3.0: energy efficiency and microarchitectural characterization of Big Data processing frameworks
Future Generat. Comput. Syst.
Energy-aware dynamical hosts and tasks assignment for cloud computing
J. Syst. Software
Energy aware resource allocation of cloud data center: review and open issues
Cluster Comput.
Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS
FPGA-accelerated hadoop cluster for deep learning computations
Professional Hadoop
Apache Hadoop
A measurement-based characterization of the energy consumption in data center servers
IEEE J. Sel. Area. Commun.
Speedy Cloud: Cloud Computing with Support for Hardware Acceleration Services
IEEE Trans. Cloud Comput.
An energy-aware task scheduling in the cloud computing using a hybrid cultural and ant colony optimization algorithm
Int. J. Cloud Appl. Comput. (IJCAC)
Energy-harvesting based on internet of things and big data analytics for smart health monitoring
Sustain. Comput. Info. Syst.
Hadoop MapReduce performance on SSDs: the case of complex network analysis tasks
SLA-aware energy-efficient scheduling scheme for Hadoop YARN
J. Supercomput.
Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling
J. Ambient Intell. Hum. Comput.
Online knowledge sharing mechanisms: a systematic review of the state of the art literature and recommendations for future research
Inf. Syst. Front
The hadoop ecosystem technologies and tools
Improving performance of heterogeneous mapreduce clusters with adaptive task tuning
IEEE Trans. Parallel Distr. Syst.
Energy efficiency aware task assignment with DVFS in heterogeneous hadoop clusters
Cited by (38)
Blockchain based Securing Medical Records in Big Data Analytics
2023, Data and Knowledge EngineeringCitation Excerpt :So, there is no pseudonymity. Access control is particularly difficult task in electronic health, because resources, data are dispersed amid the similar installation and organizations [15,16]. Therefore, certain solutions are essential towards to solve this issues, which is motivated to do this research area.
SAAS parallel task scheduling based on cloud service flow load algorithm
2022, Computer CommunicationsCitation Excerpt :The global scheduler is responsible for assigning new tasks to the appropriate virtual machines. The local controller uses reinforcement learning technology to automatically control the switch of the virtual machine by predicting the busy or idle state of each virtual machine in the future [33]. In terms of resource allocation in a competitive environment, Buyya and others put forward the concept of market-oriented cloud computing, which laid the foundation for the commercialization of cloud computing.
Analysis of hadoop MapReduce scheduling in heterogeneous environment
2021, Ain Shams Engineering JournalCitation Excerpt :The processing part is done by MapReduce. MapReduce processing comprised of two main tasks- Map and reduce [22,23]. The detailed processing of MapReduce is explained below-MapReduce execution starts with submitting the input file which resides in HDFS.
SPO: A Secure and Performance-aware Optimization for MapReduce Scheduling
2021, Journal of Network and Computer ApplicationsCitation Excerpt :Apache Hadoop assists the distributed storing and processing of big datasets using Google’s MapReduce and Google File System (GFS) models. The prevalence of Hadoop in industries and academic communities is due to its open-source solution (Shabestari et al., 2019). The Hadoop framework is classified as described in Section 2.3.
A systematic study on meta-heuristic approaches for solving the graph coloring problem
2020, Computers and Operations ResearchCitation Excerpt :According to Cook et al. (1997), SLR has been distinguished from an old study, if there's any duplicable, technical, and clear procedure. The goal of an SLR is presenting a thorough outline of present significant works (Aznoli and Navimipour, 2017; Pourghebleh and Jafari Navimipour, 2019; Shabestari et al., 2019). As a technique, it was stimulated by the discipline of medicine (Kitchenham, 2004; Ebrahimi et al., 2014; Rahim et al., 2013; Nesioonpour et al., 2014) which offered a look into technique and adequate points of interest repeated by different scientists (Cook et al., 1997; Charband and Navimipour, 2016).
Influence of Social and Environmental Responsibility in Energy Efficiency Management for Smart City
2022, Journal of Interconnection Networks
Fatemeh Shabestari received his B.S. in computer engineering, software, from Shabestar Branch, Islamic Azad University, Shabestar, Iran, in 2005 and the M.S. in computer engineering, software, from Shabestar Branch, Islamic Azad University, Shabestar, Iran, in 2009. She is currently a Ph.D. candidate in computer engineering at Science and Research Branch, Islamic Azad University, Tehran, Iran. Her research interests include big data and green computing.
Amir Masoud Rahmani received his B.S. in Computer Engineering from Amir Kabir University, Tehran, in 1996, the MS in Computer Engineering from Sharif University of Technology, Tehran, in 1998 and the Ph.D. degree in Computer Engineering from IAU University, Tehran, in 2005. Currently, he is a Professor in the Department of Computer Engineering at the IAU University. He is the author/co-author of more than 150 publications in technical journals and conferences. His research interests are in the areas of distributed systems, ad hoc and wireless sensor networks and evolutionary computing.
Nima Jafari Navimipour received his B.S. in computer engineering, software engineering, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2007; the M.S. in computer engineering, computer architecture, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2009; the Ph.D. in computer engineering, computer architecture, from Science and Research Branch, Islamic Azad University, Tehran, Iran in 2014. He is an assistance professor in the Department of Computer Engineering at Tabriz Branch, Islamic Azad University, Tabriz, Iran. He has published more than 100 papers in various journals and conference proceedings. His research interests include Cloud Computing, Social Networks, Fault-Tolerance Software, QCA, Internet of Things, and Network on Chip.
Sam Jabbehdari is currently working as an associated professor at the department of Computer Engineering in IAU (Islamic Azad University), North Tehran Branch, in Tehran, since 1993. He received his both B.Sc. and M.S. degrees in Electrical Engineering Telecommunication from Khajeh Nasir Toosi University of Technology, and IAU, South Tehran branch in Tehran, Iran, respectively. He was honored Ph.D. degree in Computer Engineering from IAU, Science and Research Branch, Tehran, Iran in 2005. His current research interests are Scheduling, QoS, MANETs, Wireless Sensor Networks and Cloud Computing.