Skip to main content

Fault Tolerance Techniques for High-Performance Computing

  • Chapter
  • First Online:
Fault-Tolerance Techniques for High-Performance Computing

Part of the book series: Computer Communications and Networks ((CCN))

Abstract

This chapter provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the chapter by discussing techniques to cope with silent errors (or silent data corruption).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See Sect. 1.3.2.1 for a detailed explanation on how these values (9 h or 53 min) are computed.

  2. 2.

    See https://code.google.com/p/cryopid/.

  3. 3.

    As a side note, one needs only 23 persons for the probability of a common birthday to reach 0.5 (a question often asked in geek evenings).

References

  1. Amdahl G (1967) The validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS conference proceedings, vol 30. AFIPS Press, pp 483–485

    Google Scholar 

  2. Aupy G, Robert Y, Vivien F, Zaidouni D (2013) Checkpointing strategies with prediction windows. In: IEEE 19th Pacific Rim international symposium on dependable computing (PRDC), 2013. IEEE, pp 1–10

    Google Scholar 

  3. Aupy G, Benoit A, Herault T, Robert Y, Vivien F, Zaidouni D (2013) On the combination of silent error detection and checkpointing. In: PRDC 2013, the 19th IEEE Pacific Rim international symposium on dependable computing. IEEE Computer Society Press

    Google Scholar 

  4. Aupy G, Robert Y, Vivien F, Zaidouni D (2014) Checkpointing algorithms and fault prediction. J Parallel Distrib Comput 74(2):2048–2064

    Article  Google Scholar 

  5. Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: International conference high performance computing, networking, storage and analysis SC’11

    Google Scholar 

  6. Benson AR, Schmit S, Schreiber R (2013) Silent error detection in numerical time-stepping schemes. CoRR, abs/1312.2674

    Google Scholar 

  7. Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK users’ guide. SIAM

    Google Scholar 

  8. Bosilca G, Delmas R, Dongarra J, Langou J (2009) Algorithm-based fault tolerance applied to high performance computing. J Parallel Distrib Comput 69(4):410–416

    Article  Google Scholar 

  9. Bosilca G, Bouteiller A, Herault T, Robert Y, Dongarra JJ (2014) Assessing the impact of ABFT and checkpoint composite strategies. In: 2014 IEEE international parallel and distributed processing symposium workshops, Phoenix, AZ, USA, May 19–23 2014, pp 679–688

    Google Scholar 

  10. Bosilca G, Bouteiller A, Brunet E, Cappello F, Dongarra J, Guermouche A, Herault T, Robert Y, Vivien F, Zaidouni D (2014) Unified model for assessing checkpointing protocols at extreme-scale. Concurr Comput Pract Exp 26(17):925–957

    Article  Google Scholar 

  11. Bosilca G, Bouteiller A, Herault T, Robert Y, Dongarra JJ (2015) Composing resilience techniques: ABFT, periodic and incremental checkpointing. IJNC 5(1):2–25

    Article  Google Scholar 

  12. Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of SC’11

    Google Scholar 

  13. Bouteiller A, Herault T, Krawezik G, Lemarinier P, Cappello F (2006) MPICH-V: a multiprotocol fault tolerant MPI. IJHPCA 20(3):319–333

    Google Scholar 

  14. Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. Concurr Comput Pract Exp 22(16):2196–2211

    Article  Google Scholar 

  15. Bouteiller A, Herault T, Bosilca G, Dongarra JJ (2011) Correlated set coordination in fault tolerant message logging protocols. In: Proceedings of Euro-Par’11 (II). LNCS, vol 6853. Springer, pp 51–64

    Google Scholar 

  16. Bouteiller A, Herault T, Bosilca G, Du P, Dongarra J (2015) Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans Parallel Comput 1(2):10:1–10:28

    Google Scholar 

  17. Bronevetsky G, de Supinski B (2008) Soft error vulnerability of iterative linear algebra methods. In: Proceedings of 22nd international conference on supercomputing, ICS ’08. ACM, pp 155–164

    Google Scholar 

  18. Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Montana State University

    Google Scholar 

  19. Casanova H, Robert Y, Vivien F, Zaidouni D (2012) Combining process replication and checkpointing for resilience on exascale systems. Research report RR-7951, INRIA

    Google Scholar 

  20. Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63–75

    Google Scholar 

  21. Chen Z (2013) Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of 18th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’13. ACM, pp 167–176

    Google Scholar 

  22. Chen Z, Dongarra J (2006) Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: Proceedings of the 20th international conference on parallel and distributed processing, IPDPS’06, Washington, DC, USA. IEEE Computer Society, pp 97–97

    Google Scholar 

  23. Chen Z, Dongarra J (2008) Algorithm-based fault tolerance for fail-stop failures. IEEE TPDS 19(12):1628–1641

    Google Scholar 

  24. Choi J, Demmel J, Dhillon I, Dongarra J, Ostrouchov S, Petitet A, Stanley K, Walker D, Whaley R (1996) ScaLAPACK: a portable linear algebra library for distributed memory computers-design issues and performance. Comput Phys Commun 97(1–2):1–15

    Article  MATH  Google Scholar 

  25. Daly JT (2004) A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22(3):303–312

    Article  Google Scholar 

  26. Davies T, Karlsson C, Liu H, Ding C, Chen Z (2011) High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the international conference on supercomputing, ICS ’11. ACM, New York, pp 162–171

    Google Scholar 

  27. Dongarra J, Beckman P, Aerts P, Cappello F, Lippert T, Matsuoka S, Messina P, Moore T, Stevens R, Trefethen A, Valero M (2009) The international exascale software project: a call to cooperative action by the global high-performance community. Int J High Perform Comput Appl 23(4):309–322

    Google Scholar 

  28. Dongarra J, Herault T, Robert Y (2014) Performance and reliability trade-offs for the double checkpointing algorithm. Int J Netw Comput 4(1):23–41

    Google Scholar 

  29. Du P, Bouteiller A, Bosilca G, Herault T, Dongarra J (2012) Algorithm-based fault tolerance for dense matrix factorizations. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP 2012, New Orleans, LA, USA, 25–29 February 2012, pp 225–234

    Google Scholar 

  30. Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C (2012) Combining partial redundancy and checkpointing for HPC. In: Proceedings of ICDCS ’12. IEEE Computer Society

    Google Scholar 

  31. Engelmann C, Ong HH, Scorr SL (2009) The case for modular redundancy in large-scale highh performance computing systems. In: Proceedings of the 8th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 189–194

    Google Scholar 

  32. Esteban Meneses CLM, Kalé LV (2010) Team-based message logging: preliminary results. In: Workshop resilience in clusters, clouds, and grids (CCGRID 2010)

    Google Scholar 

  33. Ferreira K, Stearley J, Laros JHI, Oldfield R, Pedretti, K, Brightwell R, Riesen R, Bridges, PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of the ACM/IEEE on supercomputing

    Google Scholar 

  34. Flajolet P, Grabner PJ, Kirschenhofer P, Prodinger H (1995) On Ramanujan’s Q-function. J Comput Appl Math 58:103–116

    Article  MathSciNet  MATH  Google Scholar 

  35. Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behavior of large-scale HPC systems. In: Proceedings of IPDPS’12

    Google Scholar 

  36. Guermouche A, Ropars T, Snir M, Cappello F (to appear) HydEE: failure containment without event logging for large scale send-deterministic MPI applications. In: Proceedings of IEEE IPDPS 2012

    Google Scholar 

  37. Gustafson JL (1988) Reevaluating Amdahl’s law. IBM Syst J 31(5):532–533

    Google Scholar 

  38. Gärtner F (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput Surv 31(1):1–26

    Google Scholar 

  39. Hacker TJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69(7):652–665

    Google Scholar 

  40. Hakkarinen D, Chen Z (2010) Algorithmic cholesky factorization fault recovery. In: 2010 IEEE International symposium on parallel distributed processing (IPDPS). IEEE, Atlanta, pp 1–10

    Google Scholar 

  41. Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (BLCR) for linux clusters. In: Proceedings of SciDAC 2006

    Google Scholar 

  42. Heath T, Martin RP, Nguyen TD (2002) Improving cluster availability using workstation validation. SIGMETRICS Perform Eval Rev 30(1):217–227

    Google Scholar 

  43. Heien R, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures on large parallel system. In: Proceedings of the IEEE/ACM conference on supercomputing (SC)

    Google Scholar 

  44. Heroux M, Hoemmen M (2011) Fault-tolerant iterative methods via selective reliability. Research report SAND2011-3915 C, Sandia National Laboratories

    Google Scholar 

  45. Huang K-H, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput C-33(6):518–528

    Google Scholar 

  46. Huang K-H, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput 33(6):518–528

    Article  Google Scholar 

  47. Hursey J, Squyres J, Mattox T, Lumsdaine A (2007) The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IEEE international parallel and distributed processing symposium, 2007. IPDPS. pp 1–8

    Google Scholar 

  48. Hwang AA, Stefanovici IA, Schroeder B (2012) Cosmic rays don’t strike twice: understanding the nature of dram errors and the implications for system design. SIGARCH Comput Archit News 40(1):111–122

    Article  Google Scholar 

  49. Kella O, Stadje W (2006) Superposition of renewal processes and an application to multi-server queues. Stat Probab Lett 76(17):1914–1924

    Article  MathSciNet  MATH  Google Scholar 

  50. Kingsley G, Beck M, Plank JS (1995) Compiler-assisted checkpoint optimization using SUIF. In: First SUIF compiler workshop

    Google Scholar 

  51. Kondo D, Chien A, Casanova H (2007) Scheduling task parallel applications for rapid application turnaround on enterprise desktop grids. J Grid Comput 5(4):379–405

    Article  Google Scholar 

  52. Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565

    Article  MATH  Google Scholar 

  53. Li C-C, Fuchs W (1990) Catch-compiler-assisted techniques for checkpointing. In: 20th international symposium fault-tolerant computing, 1990. FTCS-20. Digest of papers, pp 74–81

    Google Scholar 

  54. Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IPDPS’08. IEEE

    Google Scholar 

  55. Lu G, Zheng Z, Chien AA (2013) When is multi-version checkpointing needed. In: 3rd Workshop for fault-tolerance at extreme scale (FTXS). ACM Press. https://sites.google.com/site/uchicagolssg/lssg/research/gvr

  56. Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200–209

    Article  MATH  Google Scholar 

  57. Moody A, Bronevetsky G, Mohror K, Supinski B (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: International conference high performance computing, networking, storage and analysis SC’10

    Google Scholar 

  58. Moody A, Bronevetsky G, Mohror K, Supinski BR de (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the ACM/IEEE conference SC, pp 1–11

    Google Scholar 

  59. Ni X, Meneses E, Kalé LV (2012) Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In: Proceedings of IEEE international conference on cluster computing. IEEE Computer Society

    Google Scholar 

  60. Ni X, Meneses E, Jain N, Kalé LV (2013) ACR: automatic checkpoint/restart for soft and hard error protection. In: Proceedings of international conference high performance computing, networking, storage and analysis, SC ’13. ACM

    Google Scholar 

  61. O’Gorman T (1994) The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Trans Electron Devices 41(4):553–557

    Article  Google Scholar 

  62. Plank JS, Beck M, Kingsley G (1995) Compiler-assisted memory exclusion for fast checkpointing. IEEE Tech Comm Oper Syst Appl Environ 7:10–14

    Google Scholar 

  63. Rodríguez G, Martín MJ, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concurr Comput Pract Exp 22(6):749–766

    Google Scholar 

  64. Ross SM (2009) Introduction to probability models, 8th edn. Academic Press, San Diego

    Google Scholar 

  65. Sao P, Vuduc R (2013) Self-stabilizing iterative solvers. In: Proceedings of ScalA ’13. ACM

    Google Scholar 

  66. Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):188–198

    Google Scholar 

  67. Schroeder B, Gibson GA (2006) A large-scale study of failures in high-performance computing systems. In: Proceedings of DSN, pp 249–258

    Google Scholar 

  68. Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of ICS ’12. ACM

    Google Scholar 

  69. Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531

    Article  MATH  Google Scholar 

  70. Yu L, Zheng Z, Lan Z, Coghlan S (2011) Practical online failure prediction for blue gene/p: period-based vs event-driven. In: Dependable systems and networks workshops (DSN-W), pp 259–264

    Google Scholar 

  71. Zheng G, Shi L, Kalé LV (2004) FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proceedings of IEEE international conference on cluster computing. IEEE Computer Society

    Google Scholar 

  72. Zheng Z, Lan Z (2009) Reliability-aware scalability models for high performance computing. In: Proceedings of the IEEE conference on cluster computing

    Google Scholar 

  73. Zheng Z, Lan Z, Gupta R, Coghlan S, Beckman P (2010) A practical failure prediction with location and lead time for blue gene/p. In: Dependable systems and networks workshops (DSN-W), pp 15–22

    Google Scholar 

  74. Ziegler J, Muhlfeld H, Montrose C, Curtis H, O’Gorman T, Ross J (1996) Accelerated testing for cosmic soft-error rate. IBM J Res Dev 40(1):51–72

    Article  Google Scholar 

  75. Ziegler J, Nelson M, Shell J, Peterson R, Gelderloos C, Muhlfeld H, Montrose C (1998) Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE J Solid-State Circuits 33(2):246–252

    Article  Google Scholar 

  76. Ziegler JF, Curtis HW, Muhlfeld HP, Montrose CJ, Chin B (1996) IBM experiments in soft fails in computer electronics. IBM J Res Dev 40(1):3–18

    Article  Google Scholar 

Download references

Acknowledgments

Yves Robert is with the Institut Universitaire de France. The research presented in this chapter was supported in part by the French ANR (Rescue project) and by contracts with the DOE through the SUPER-SCIDAC project, and the CREST project of the Japan Science and Technology Agency (JST). This chapter has borrowed material from publications co-authored with many colleagues and PhD students, and the authors would like to thank Guillaume Aupy, Anne Benoit, George Bosilca, Aurélien Bouteiller, Aurélien Cavelan, Franck Cappello, Henri Casanova, Amina Guermouche, Saurabh K. Raina, Hongyang Sun, Frédéric Vivien, and Dounia Zaidouni.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Herault .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Dongarra, J., Herault, T., Robert, Y. (2015). Fault Tolerance Techniques for High-Performance Computing. In: Herault, T., Robert, Y. (eds) Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-20943-2_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20943-2_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20942-5

  • Online ISBN: 978-3-319-20943-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics