Skip to main content
Log in

Scheduling and checkpointing optimization algorithm for Byzantine fault tolerance in cloud clusters

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Cloud computing distinguishes itself from other distributed computing paradigm through offering services on-demand basis without any geographical restrictions. This revolutionizes the computing by offering services to wide array of customers starting from casual user to highly business oriented Industries. In spite of its capabilities, cloud computing still struggle with handling wide array of faults, this causes loss of credibility to cloud computing. Among those faults Byzantine faults offers serious challenge to fault tolerance mechanism, because it often go undetected at the initial stage and it can easily propagate to other VMs before a detection is made. Consequently some of the mission critical application such as air traffic control, online baking etc. still staying away from the cloud for such reasons. However if a Byzantine faults is not detected and tolerated at initial stage then applications such as big data analytics can go completely wrong in spite of hours of computations performed by the entire cloud. Therefore in the previous work a fool-proof Byzantine fault detection has been proposed, as a continuation this work designs a scheduling algorithm (WSSS) and checkpoint optimization algorithm (TCC) to tolerate and eliminate the Byzantine faults before it makes any impact. The WSSS algorithm keeps track of server performance which is part of virtual clusters to help allocate best performing server to mission critical application. WSSS therefore ranks the servers based on a counter which monitors every virtual nodes (VN) for time and performance failures. The TCC algorithm works to generalize the possible Byzantine error prone region through monitoring delay variation to start new VNs with previous checkpointing. Moreover it can stretch the state interval for performing and error free VNs in an effect to minimize the space, time and cost overheads caused by checkpointing. The analysis is performed with plotting state transition and CloudSim based simulation. The result shows TCC reduces fault tolerance overhead exponentially and the WSSS allots virtual resources effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Meroufel, Bakhta, Belalem, Ghalem: Adaptive time based coordinated checkpointing for cloud computing workflows. Scalable Comput. 15(2), 153–168 (2014)

    Google Scholar 

  2. Bala, A.: Scrutinize: fault monitoring for preventing system failure in cloud computing. In: International Journal of Innovations & Advancement in Computer Science (IJIACS), vol. 4 (2015)

  3. Andrzejak, A., Kondo, D., Yi, S.: Decision model for cloud computing under SLA constraints. Inria Research Report (2010)

  4. Yi, S., Kondo, D., Andrzejak, A.: Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In: Third IEEE International Conference on Cloud Computing (2010)

  5. Buyya, R., Ranjan, R., Calheiros, R.N.: InterCloud: utility-oriented federation of cloud computing environments for scaling of application services. In: International Conference on Algorithms and Architectures for Parallel Processing, vol. 6081, pp. 13–31. Springer, Berlin (2010)

    Chapter  Google Scholar 

  6. Zhou, A., Wang, S., Cheng, B., et al.: Cloud service reliability enhancement via virtual machine placement optimization. IEEE Trans. Service Comput. 10(6), 902–913 (2017)

    Article  Google Scholar 

  7. Brandt, J., Gentile, A., Mayo, J., Pebay, P., Roe, D., Thompson, D., Wong, M.: Resource monitoring and management with OVIS to enable HPC in cloud computing environments. In: IEEE International Symposium on Parallel & Distributed Processing (2009)

  8. Masud, M.A.H., Huang, X.: An e-learning system architecture based on cloud computing. Int. J. Inf. Commun. Eng. 6(2) (2012)

  9. Dillon, T., Wu, C., Chang, E.: Cloud computing: issues and challenges. In: 24th IEEE International Conference on Advanced Information Networking and Application, pp. 27–33 (2010)

  10. Butt, S., Lagar-Cavilla, H. A., Srivastava, A., Ganapathy, V.: Self-service cloud computing. In: Proceedings of the 19th ACM Conference on Computer and Communications Security (2012)

  11. Chinnathambi, S., Santhanam, A.: Enhancing Byzantine fault tolerance using integrated detection in cloud systems. In: IndoSys - Indian Symposium on Computer Systems Research (2017)

  12. Duan, S., Peisert, S., Levitt, K.N.: hBFT: speculative Byzantine fault tolerance with minimum cost. IEEE Trans. Dependable Secure Comput. (TDSC) 12(1), 58–70 (2015)

    Article  Google Scholar 

  13. Saikia, L.P., Devi, Y.L.: Fault tolerance techniques and algorithms in cloud computing. Int. J. Comput. Sci. Commun. Netw. 4(1), 1–8 (2014)

    Google Scholar 

  14. Liu, Y., Wei, W.: A replication-based mechanism for fault tolerance in MapReduce framework. Hindawi Mathematical Problems in Engineering (2015)

  15. Zaidi, T., Rampratap, : Modeling for fault tolerance in cloud computing environment. J. Comput. Sci. Appl. 4(1), 9–13 (2016)

    Google Scholar 

  16. Bala, A., Chana, I.: Fault tolerance-challenges, techniques and implementation in cloud computing. In: IJCSI International Journal of Computer Science Issues, vol. 9, no. 1 (2012)

  17. Essa, Y.M.: A survey of cloud computing fault tolerance: techniques and implementation. Int. J. Comput. Appl. 138(13):34–38 (2016). https://doi.org/10.5120/ijca2016909055

    Article  Google Scholar 

  18. Sathya, C., Agilan, S.: Design of check pointing algorithm for fault tolerance virtual machine. Perspectivas em Ciencia da Informacao. 22(2):269 (2017)

    Google Scholar 

  19. Ismail, L., Barua, R.: Implementation and performance evaluation of a distributed conjugate gradient method in a cloud computing environment. Software. 43(3), 281–304 (2010)

    Google Scholar 

  20. Ji, C., Li, Y., Qiu, W., Awada, U., Li, K.: Big data processing in cloud computing environments. In: International Symposium on Pervasive Systems, Algorithms and Networks (2012)

  21. Yi, S., Andrzejak, A., Kondo, D.: Monetary cost-aware checkpointing and migration on amazon cloud spot instances. IEEE Trans. Serv. Comput. 5(4), 512–524 (2011)

    Article  Google Scholar 

  22. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise of “big data” on cloud computing: review and open research issues. Inform. Syst. 47, 98–115 (2015)

    Article  Google Scholar 

  23. Dong, Z., Liu, N., Rojas-Cessa, R.: Greedy scheduling of tasks with time constraints for energy-efficient cloud-computing data centers. J. Cloud Comput. 4(1), 5 (2015)

    Article  Google Scholar 

  24. Jung, D., Chin, S., Chung, K., Yu, H., Gil, J.: JoonMin: an efficient checkpointing scheme using price history of spot instances in cloud computing environment. In IFIP International Conference on Network and Parallel Computing, vol. 6985, pp. 185–200. Springer, Berlin (2011)

    Google Scholar 

  25. Zhou, B., Buyya, R.: A group-based fault tolerant mechanism for heterogeneous mobile clouds. In: Proceedings of the 14th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (2017)

  26. Jhawar, R., Piuri, V., Santambrogio, M.: A comprehensive conceptual system-level approach to fault tolerance in cloud computing. In: IEEE International Systems Conference (SysCon) (2012)

  27. Tang, Z., Qi, L., Cheng, Z., Li, K., Khan, S.U., Li, K.: An energy-efficient task scheduling algorithm in DVFS-enabled cloud environment. J. Grid Comput. 13(1) (2015)

    Article  Google Scholar 

  28. Liu, H., Jin, H., Xu, C.Z., Liao, X.: Performance and energy modeling for live migration of virtual machines. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, pp. 171–182 (2011)

  29. Voorsluys, W., Broberg, J., Venugopal, S., Buyya, R.: Cost of virtual machine live migration in clouds: a performance evaluation. In: Proceedings of the 1st International Conference on Cloud Computing, pp. 254–265 (2009)

    Google Scholar 

  30. Zhang, F., Cao, J., Hwang, K., Wu, C.: Ordinal optimized scheduling of scientific workflows in elastic compute clouds. In: Third IEEE International Conference on Coud Computing Technology and Science, pp. 9–17 (2011)

  31. Chowdhury, M.R., Mahmud, M.R., Rahman, R.M.: Implementation and performance analysis of various VM placement strategies in CloudSim. J. Cloud Comput. (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sathya Chinnathambi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chinnathambi, S., Santhanam, A., Rajarathinam, J. et al. Scheduling and checkpointing optimization algorithm for Byzantine fault tolerance in cloud clusters. Cluster Comput 22 (Suppl 6), 14637–14650 (2019). https://doi.org/10.1007/s10586-018-2375-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-018-2375-9

Keywords

Navigation