A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop

https://doi.org/10.1016/j.jnca.2018.11.007Get rights and content

Abstract

Apache Hadoop framework supports the storing and processing of big data datasets using simple programming models. Energy management has been recognized as one of the major issues in Hadoop, and many types of research have been conducted in this scope. However, despite the importance of this issue, there is no inclusive study about energy efficiency in Hadoop. In this paper, the techniques of energy efficiency in Hadoop are classified into two main categories. Moreover, the benefits and drawbacks of these methods and a systematic study of the conducted research are provided and examined in this paper. Another aim is to provide the visions for the descriptions of open issues and recommendations for future research.

Introduction

Nowadays, the ability to analyze the big data repositories remains a problem in many modern enterprises and research societies (Gonçalves et al., 2017; Khezr and Navimipour, 2017). Every day, large amounts of data are produced from numerous sources e.g. sensors, digital pictures, videos, purchase transaction records, and cell phone, but mining suitable information for making an appropriate decision from these massive data repositories is almost impractical for the traditional database management system (DBMS) technologies (Cuzzocrea et al., 2011).

The Hadoop as an applicable solution to big data (Rashmi and Basu, 2017) provides reliable, fault-tolerant, scalable, and efficient services for large amounts of data processing using MapReduce (Uzunkaya et al., 2015; Zhao, 2017). The simple programming interface, high scalability and the capability of processing a high amount of data in the distributed processing environments are considered as its main features (Khezr and Navimipour, 2017). The MapReduce has an important role in performing a very large number of data-intensive applications (Cassales et al., 2015; Chelliah, 2017).

Recently, the significant issue in the data centers is the efficiency of energy (Khan et al., 2016; Kurpicz et al., 2018). According to U.S. department of energy report in 2014, the U.S data centers spent about 70 billion kWh (1.8% of entire U.S. electricity consumption). It is estimated that about 73 billion kWh will be consumed by U.S data centers in 2020 (Shehabi et al., 2016). Given the environmental challenges and the limited resources of energy and high energy costs (Akhter and Othman, 2016; Babar et al., 2017), hardware and software techniques should be used to reduce the energy consumption. As a result, the energy reduction is a big challenge for Hadoop which consists of the large cluster (Usama et al., 2017).

This paper is a key systematic one about the energy utilization techniques for Hadoop. It discusses different software methods, such as scheduling, and hardware methods such as Dynamic Voltage and Frequency Scaling (DVFS) which are employed to reduce the energy consumption. Providing the conceptual aspects of energy efficiency in Hadoop is the main goal of this paper. The contributions of this study are listed below:

  • Providing a review on the current energy-aware methods for Hadoop.

  • Dividing energy efficiency techniques into two main classes, including software-based and hardware-based techniques.

  • Providing the benefits and drawbacks of the existing energy efficiency techniques for Hadoop.

  • Discussing and comparing the main challenges for the energy efficiency in the Hadoop.

  • Highlighting the guidelines for future research and open issues about the energy efficiency in the Hadoop.

Furthermore, the techniques are compared in this paper using some performance measures and the Quality of Service (QoS) parameters (Conejero et al., 2016) such as data locality, fault tolerance, heterogeneity, scalability, makespan, performance, cost, and load balancing. Therefore, we provide a brief discussion of them.

  • Data Locality: It means moving computation close to data rather than moving data towards computation (George et al., 2016).

  • Heterogeneity: In a heterogeneous data center, there are some nodes with dissimilar abilities such as computing power (Rasooli and Down, 2014b).

  • Fault tolerance: It provides continuous and correct operation of a system in the presence of the failure of its component(s) (Sampaio and Barbosa, 2018).

  • Scalability: It is the capacity to be changed and reformed in numerous situations in a Hadoop cluster (Zhang et al., 2018).

  • Makespan: It is the time variance between the beginning and the end of the job or task sequence (Kalra and Singh, 2015).

  • Load balancing: It enhances the distribution of loads across multiple computing resources (Gao and Yu, 2017; Ghomi et al., 2017).

  • Cost: Two types of costs can be considered, one is in term of manpower and the other in term of money (Majeed and Shah, 2015).

  • Performance: The amount of useful estimated fulfilled work in terms of time needed, used resources, etc. (Cheng et al., 2017a).

The arrangement of various sections of the articles is as follows: Hadoop and its components are presented in Section 2. Section 3 reviews the related work. Section 4 provides the research selection process and a Systematic Literature Review (SLR). Section 5 systematically overviews the energy efficiency approaches in the Hadoop and classifies them. Furthermore, this section provides a comparison of the methods of the selected articles. Section 6 discusses the obtained findings. Some open issues are elaborated in Section 7. Finally, Section 8 presents the conclusion in addition to the paper limitations.

Section snippets

Background

The Google's MapReduce and Google File System (GFS) model are performed by Apache Hadoop (Cassales et al., 2016; Li et al., 2017; Park et al., 2016; Qin et al., 2017; Veiga et al., 2018) that supports the storing and processing of big datasets. It has attracted the attention of both the industrial communities and academic due to its open source solution (Polato et al., 2014). The Hadoop framework is classified as follows:

Motivation and related work

Some related works on Hadoop, MapReduce, and energy issues are discussed briefly in this section.

Majeed and Shah (2015) have presented a survey according to the state-of-the-art on some techniques and architectures of the energy efficiency in big data during 2007–2015. First of all, they have considered the existing surveys on energy consumption utilization. Then, they have categorized the research papers in terms of a hardware-based, component-based and the best energy efficiency methods that

Research methodology

The SLR is offered in this section to improve the understanding of the energy efficiency techniques in the Hadoop. All examination that addresses a specific issue is analyzed by SLR which is a critical assessment (Navimipour and Charband, 2016; Soltani and Navimipour, 2016). The article classification and selection process as two parts of the search process are discussed in the next subsections.

Energy efficiency techniques in Hadoop

The present section describes the differences, advantages, and disadvantages of the main state-of-the-art energy efficiency mechanisms in the Hadoop. We review software-based and hardware-based articles for reducing the energy consumption in the Hadoop. These articles have applied software or hardware techniques, or both.

Discussion

In the previous sections, we discussed the energy efficiency techniques of Hadoop in two main groups: software-based and hardware-based techniques. Now, a statistical analysis of declared techniques regarding the energy efficiency in the Hadoop is going to be considered. Table 4 and Table 5 show the main properties of the discussed methods like kind of Hadoop environment, the platform of implementation or simulation in software-based and hardware-based techniques, respectively. Also, Fig. 8

Open challenges and future work

Future works should consider some important challenges. The mentioned issues are discussed and investigated in this section. In the rest of this section, some important directions are provided for future researches.

  • Heterogeneity as the main cause of performance variability is available in the hardware and workload characteristics. The performance and energy consumption vary by performing the same task on various nodes. Some factors such as type of workload and the rate of Hadoop's tasks can

Conclusion and limitation

This paper refers to survey the previous and the present mechanisms for energy efficiency in the Hadoop systematically. First, we have overviewed Hadoop and its components. Then, we have explained the research methodology and have classified 22 selected articles into two groups that 11 of them are software-based approach and 11 of them are the hardware-based approach. Also, important methods of each category and their advantages and disadvantages are discussed. The reason behind addressing the

Fatemeh Shabestari received his B.S. in computer engineering, software, from Shabestar Branch, Islamic Azad University, Shabestar, Iran, in 2005 and the M.S. in computer engineering, software, from Shabestar Branch, Islamic Azad University, Shabestar, Iran, in 2009. She is currently a Ph.D. candidate in computer engineering at Science and Research Branch, Islamic Azad University, Tehran, Iran. Her research interests include big data and green computing.

References (112)

  • Y.-C. Kao et al.

    Data-locality-aware mapreduce real-time scheduling framework

    J. Syst. Software

    (2016)
  • M. Kurpicz et al.

    Energy-proportional profiling and accounting in heterogeneous virtualized environments

    Sustain. Comput. Info. Syst.

    (2018)
  • P. Leimich et al.

    A RAM triage methodology for Hadoop HDFS forensics

    Digit. Invest.

    (2016)
  • Z. Lu et al.

    InSTechAH: cost-effectively autoscaling smart computing hadoop cluster in private cloud

    J. Syst. Architect.

    (2017)
  • I. Mavridis et al.

    Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark

    J. Syst. Software

    (2017)
  • K. Neshatpour et al.

    Energy-efficient acceleration of MapReduce applications using FPGAs

    J. Parallel Distr. Comput.

    (2018)
  • P.P. Nghiem et al.

    Towards efficient resource provisioning in MapReduce

    J. Parallel Distr. Comput.

    (2016)
  • A. Oussous et al.

    Big Data technologies: A survey

    J. King Saud Univ. Comput. Info. Sci.

    (2018)
  • I. Polato et al.

    A comprehensive view of Hadoop research—a systematic literature review

    J. Netw. Comput. Appl.

    (2014)
  • A. Rasooli et al.

    COSHH: a classification and optimization based scheduler for heterogeneous Hadoop systems

    Future Generat. Comput. Syst.

    (2014)
  • A. Reuther et al.

    Scalable system scheduling for HPC and big data

    J. Parallel Distr. Comput.

    (2018)
  • N.B. Rizvandi et al.

    Some observations on optimal frequency selection in DVFS-based energy consumption minimization

    J. Parallel Distr. Comput.

    (2011)
  • A.M. Sampaio et al.

    A comparative cost analysis of fault-tolerance mechanisms for availability on the cloud

    Sustain. Comput. Info. Syst.

    (2018)
  • Y. Shao et al.

    Efficient jobs scheduling approach for big data applications

    Comput. Ind. Eng.

    (2018)
  • S. Singh et al.

    Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster

    Comput. Elect. Eng.

    (2018)
  • Z. Soltani et al.

    Customer relationship management mechanisms: a systematic review of the state of the art literature and recommendations for future research

    Comput. Hum. Behav.

    (2016)
  • J. Song et al.

    Modulo based data placement algorithm for energy consumption optimization of MapReduce system

    J. Grid Comput.

    (2016)
  • M. Soualhia et al.

    Task scheduling in big data platforms: a systematic literature review

    J. Syst. Software

    (2017)
  • A. Spivak et al.

    Storage tier-aware replicative data reorganization with prioritization for efficient workload processing

    Future Generat. Comput. Syst.

    (2018)
  • M. Usama et al.

    Job schedulers for Big data processing in Hadoop environment: testing real-life schedulers using benchmark programs

    Digital Commun. Netw.

    (2017)
  • C. Uzunkaya et al.

    Hadoop ecosystem and its analysis on tweets

    Procedia Soc. Behav. Sci.

    (2015)
  • M. Varga et al.

    Deadline scheduling algorithm for sustainable computing in Hadoop environment

    Comput. Secur.

    (2018)
  • J. Veiga et al.

    BDEv 3.0: energy efficiency and microarchitectural characterization of Big Data processing frameworks

    Future Generat. Comput. Syst.

    (2018)
  • Y.-F. Wen

    Energy-aware dynamical hosts and tasks assignment for cloud computing

    J. Syst. Software

    (2016)
  • N. Akhter et al.

    Energy aware resource allocation of cloud data center: review and open issues

    Cluster Comput.

    (2016)
  • S.R. Alapati

    Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS

    (2016)
  • A. Alhamali et al.

    FPGA-accelerated hadoop cluster for deep learning computations

  • B. Antony et al.

    Professional Hadoop

    (2016)
  • Apache Hadoop

    (2018)
  • J.A. Aroca et al.

    A measurement-based characterization of the energy consumption in data center servers

    IEEE J. Sel. Area. Commun.

    (2015)
  • H. Artail et al.

    Speedy Cloud: Cloud Computing with Support for Hardware Acceleration Services

    IEEE Trans. Cloud Comput.

    (2017)
  • P. Azad et al.

    An energy-aware task scheduling in the cloud computing using a hybrid cultural and ant colony optimization algorithm

    Int. J. Cloud Appl. Comput. (IJCAC)

    (2017)
  • M. Babar et al.

    Energy-harvesting based on internet of things and big data analytics for smart health monitoring

    Sustain. Comput. Info. Syst.

    (2017)
  • M. Bakratsas et al.

    Hadoop MapReduce performance on SSDs: the case of complex network analysis tasks

  • X. Cai et al.

    SLA-aware energy-efficient scheduling scheme for Hadoop YARN

    J. Supercomput.

    (2017)
  • G.W. Cassales et al.

    Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling

    J. Ambient Intell. Hum. Comput.

    (2016)
  • Y. Charband et al.

    Online knowledge sharing mechanisms: a systematic review of the state of the art literature and recommendations for future research

    Inf. Syst. Front

    (2016)
  • P.R. Chelliah

    The hadoop ecosystem technologies and tools

  • D. Cheng et al.

    Improving performance of heterogeneous mapreduce clusters with adaptive task tuning

    IEEE Trans. Parallel Distr. Syst.

    (2017)
  • D. Cheng et al.

    Energy efficiency aware task assignment with DVFS in heterogeneous hadoop clusters

  • Cited by (38)

    • Blockchain based Securing Medical Records in Big Data Analytics

      2023, Data and Knowledge Engineering
      Citation Excerpt :

      So, there is no pseudonymity. Access control is particularly difficult task in electronic health, because resources, data are dispersed amid the similar installation and organizations [15,16]. Therefore, certain solutions are essential towards to solve this issues, which is motivated to do this research area.

    • SAAS parallel task scheduling based on cloud service flow load algorithm

      2022, Computer Communications
      Citation Excerpt :

      The global scheduler is responsible for assigning new tasks to the appropriate virtual machines. The local controller uses reinforcement learning technology to automatically control the switch of the virtual machine by predicting the busy or idle state of each virtual machine in the future [33]. In terms of resource allocation in a competitive environment, Buyya and others put forward the concept of market-oriented cloud computing, which laid the foundation for the commercialization of cloud computing.

    • Analysis of hadoop MapReduce scheduling in heterogeneous environment

      2021, Ain Shams Engineering Journal
      Citation Excerpt :

      The processing part is done by MapReduce. MapReduce processing comprised of two main tasks- Map and reduce [22,23]. The detailed processing of MapReduce is explained below-MapReduce execution starts with submitting the input file which resides in HDFS.

    • SPO: A Secure and Performance-aware Optimization for MapReduce Scheduling

      2021, Journal of Network and Computer Applications
      Citation Excerpt :

      Apache Hadoop assists the distributed storing and processing of big datasets using Google’s MapReduce and Google File System (GFS) models. The prevalence of Hadoop in industries and academic communities is due to its open-source solution (Shabestari et al., 2019). The Hadoop framework is classified as described in Section 2.3.

    • A systematic study on meta-heuristic approaches for solving the graph coloring problem

      2020, Computers and Operations Research
      Citation Excerpt :

      According to Cook et al. (1997), SLR has been distinguished from an old study, if there's any duplicable, technical, and clear procedure. The goal of an SLR is presenting a thorough outline of present significant works (Aznoli and Navimipour, 2017; Pourghebleh and Jafari Navimipour, 2019; Shabestari et al., 2019). As a technique, it was stimulated by the discipline of medicine (Kitchenham, 2004; Ebrahimi et al., 2014; Rahim et al., 2013; Nesioonpour et al., 2014) which offered a look into technique and adequate points of interest repeated by different scientists (Cook et al., 1997; Charband and Navimipour, 2016).

    View all citing articles on Scopus

    Fatemeh Shabestari received his B.S. in computer engineering, software, from Shabestar Branch, Islamic Azad University, Shabestar, Iran, in 2005 and the M.S. in computer engineering, software, from Shabestar Branch, Islamic Azad University, Shabestar, Iran, in 2009. She is currently a Ph.D. candidate in computer engineering at Science and Research Branch, Islamic Azad University, Tehran, Iran. Her research interests include big data and green computing.

    Amir Masoud Rahmani received his B.S. in Computer Engineering from Amir Kabir University, Tehran, in 1996, the MS in Computer Engineering from Sharif University of Technology, Tehran, in 1998 and the Ph.D. degree in Computer Engineering from IAU University, Tehran, in 2005. Currently, he is a Professor in the Department of Computer Engineering at the IAU University. He is the author/co-author of more than 150 publications in technical journals and conferences. His research interests are in the areas of distributed systems, ad hoc and wireless sensor networks and evolutionary computing.

    Nima Jafari Navimipour received his B.S. in computer engineering, software engineering, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2007; the M.S. in computer engineering, computer architecture, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2009; the Ph.D. in computer engineering, computer architecture, from Science and Research Branch, Islamic Azad University, Tehran, Iran in 2014. He is an assistance professor in the Department of Computer Engineering at Tabriz Branch, Islamic Azad University, Tabriz, Iran. He has published more than 100 papers in various journals and conference proceedings. His research interests include Cloud Computing, Social Networks, Fault-Tolerance Software, QCA, Internet of Things, and Network on Chip.

    Sam Jabbehdari is currently working as an associated professor at the department of Computer Engineering in IAU (Islamic Azad University), North Tehran Branch, in Tehran, since 1993. He received his both B.Sc. and M.S. degrees in Electrical Engineering Telecommunication from Khajeh Nasir Toosi University of Technology, and IAU, South Tehran branch in Tehran, Iran, respectively. He was honored Ph.D. degree in Computer Engineering from IAU, Science and Research Branch, Tehran, Iran in 2005. His current research interests are Scheduling, QoS, MANETs, Wireless Sensor Networks and Cloud Computing.

    View full text