Abstract
The likelihood of failures rises in cloud computing systems as a result of their unstable nature. Additionally, the size of a cloud computing system varies with time and thus failures become a common incident. Failures have a high impact on cloud performance and the expected benefits for both customers and providers. Fault tolerance is an essential challenge facing cloud providers in order to mitigate the effects of failures and maintaining the Service Level Agreement (SLA) satisfied. Checkpointing is one of the most known reactive fault tolerance techniques used in distributed computing. However, it can incur considerable overheads that depend on the interval of the checkpoint applied and these overheads put down the performance of the cloud. In this paper, a reactive fault tolerance approach in the context of checkpointing is proposed and evaluated with the aim of getting better performance. The approach depends on applying a flexible interval of the checkpoint to reduce overheads. Simulation experiments indicate superior performance of the approach in terms of power consumption, response time, monetary cost and cloud capacity.
Similar content being viewed by others
References
Abdulhamid S, Abd Latiff M (2017) A checkpointed league championship algorithm-based cloud scheduling scheme with secure fault tolerance responsiveness. Appl Soft Comput 61:670–680
Alshayeji M et al (2018) A study on fault tolerance mechanisms in cloud computing. Int J Comput Electr Eng 10:574–538
Amoon M (2015) A framework for providing a hybrid fault tolerance in cloud computing. In: Proceedings of Science and Information Conference (SAI), London, pp 844–849
BaLa A, Chana I (2012) Fault tolerance- challenges, techniques and implementation in cloud computing. Int J Comput Sci Issues 9:288–293
Benoitet A, Hakem M, Robert T (2008) Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, Miami, FL, USA, pp 1–8
Buyya R et al (2009) Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as The 5th utility. Future Gen Comput Syst 25:599–616
Di S et al (2013) Optimization of cloud task processing with checkpoint-restart mechanism. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, Denver, CO, USA, pp 1–12
El-Sayed N, Schroeder B (2018) Understanding practical tradeoffs in HPC checkpoint-scheduling policies. IEEE Trans Dependable Secure Comput 15:336–350
Goiri I, Julià F, Guitart J, Torres J (2010) Checkpoint-based fault-tolerant infrastructure for virtualized service providers. In: Proceedings of 12th IEEE/IFIP network operations and management symposium (NOMS’10), Osaka, Japan, pp 455–462
Han H et al (2018a) Fault-tolerant scheduling for hybrid real-time tasks based on CPB model in cloud. IEEE Access 6:19616–18629
Han L et al (2018b) Checkpointing workflows for fail-stop errors. IEEE Trans Comput. https://doi.org/10.1109/TC.2018.2801300
Hasan M, Goraya M (2018) Fault tolerance in cloud computing environment: a systematic survey. Comput Ind 99:156–172
Kliazovich D, Bouvry P, Khan S (2012) Greencloud: a packet-level simulator of energy-aware cloud computing data centers. J Supercomput 62:1263–1283
Kumar S, Goudar R (2012) Cloud computing–research issues, challenges, architecture, platforms and applications: a survey. Int J Future Comput Commun 1:356–360
Limam S, Belalem G (2011) Fault tolerant architecture to cloud computing using adaptive checkpoint. Int J Cloud Appl Comput 1:60–69
Limam S, Belalem G (2014) A migration approach for fault tolerance in cloud computing. Int J Grid High Perform Comput 6:24–37
Limrungsi N et al (2012) Providing reliability as an elastic service in cloud computing. In: Proceedings of IEEE International Conference on Communications (ICC), Ottawa, ON, Canada, pp 1–4
Liu D (2015) A fault-tolerant architecture for ROIA in cloud. J Ambient Intell Humaniz Comput 6:587–595
Lloyd’s (2018) Cloud Down Impacts on the US economy. AIR Worldwide. https://www.lloyds.com/~/media/files/news-and-insight/risk-insight/2018/cloud-down/aircyberlloydspublic2018final.pdf. Accessed 10 Jul 2018
Louatia T, Abbesa H, Cérinb C, Jemnia M (2018) LXCloud-CR: towards LinuX containers distributed hash table based checkpoint-restart. J Parallel Distrib Comput 111:187–205
Ni X, Meneses E, Kale L (2012) Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In: Proceedings of IEEE International Conference on Cluster Computing, Beijing, China, pp 364–372
Nu˜nez A et al (2011) Design of a new cloud computing simulation platform. In: proceedings of international conference on computational science and its applications, Santander, Spain, pp 582–593
Ostermann S et al (2011) Groudsim: an event-based simulation framework for computational grids and clouds. In: Proceedings of Euro-Par Parallel Processing Workshops. Springer, pp 305–313
Pagare J, Koli N (2015) Design and simulate cloud computing environment using Cloudsim. Int J Comput Technol Appl 6:35–42
Patel S, Singh A (2013) Fault tolerance mechanisms and its implementation in cloud computing—a review. Int J Adv Res Comput Sci Softw Eng 3:573–576
Rampratap T (2016) Modeling for fault tolerance in cloud computing environment. J Comput Sci Appl 4:9–13
Rejinpaul N, Visuwasam L (2012) Checkpoint-based intelligent fault tolerance for cloud service providers. Int J Comput Distrib Syst 2:59–64
Sadi S, Yagoubi B (2015) Acs-advanced cloud simulator: a discrete event based simulator for cloud computing environments. In: Proceedings of the 2nd international conference on networking and advanced systems, Annaba, Algeria, pp 11–16
Sadi S, Yagoubi B (2016) Communication-aware approaches for transparent checkpointing in cloud computing. Scalable Comput Pract Exp 17:251–270
Sampaio A, Barbosa J (2017) A comparative cost analysis of fault-tolerance mechanisms for availability on the cloud. Sustain Comput Inf Syst. https://doi.org/10.1016/j.suscom.2017.11.006
Shao Y et al (2017) Chord: checkpoint-based scheduling using hybrid waiting list in shared clusters. J Syst Softw 131:22–34
Singh P, Jain E (2014) Survey paper on cloud computing. Int J Innov Eng Technol 3:84–89
Ying C, Yu J, He J (2018) Towards fault tolerance optimization based on checkpoints of in-memory framework spark. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-1018-6
Acknowledgements
This work was supported by King Saud University, Deanship of Scientific Research, Community College Research Unit.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Amoon, M., El-Bahnasawy, N., Sadi, S. et al. On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. J Ambient Intell Human Comput 10, 4567–4577 (2019). https://doi.org/10.1007/s12652-018-1139-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-018-1139-y