Fault Tolerance Techniques for High-Performance Computing

Dongarra, Jack; Herault, Thomas; Robert, Yves

doi:10.1007/978-3-319-20943-2_1

Jack Dongarra^4,5,6,
Thomas Herault⁴ &
Yves Robert^4,7

Part of the book series: Computer Communications and Networks ((CCN))

1631 Accesses
22 Citations

Abstract

This chapter provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the chapter by discussing techniques to cope with silent errors (or silent data corruption).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See Sect. 1.3.2.1 for a detailed explanation on how these values (9 h or 53 min) are computed.
2.
See https://code.google.com/p/cryopid/.
3.
As a side note, one needs only 23 persons for the probability of a common birthday to reach 0.5 (a question often asked in geek evenings).

References

Amdahl G (1967) The validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS conference proceedings, vol 30. AFIPS Press, pp 483–485
Google Scholar
Aupy G, Robert Y, Vivien F, Zaidouni D (2013) Checkpointing strategies with prediction windows. In: IEEE 19th Pacific Rim international symposium on dependable computing (PRDC), 2013. IEEE, pp 1–10
Google Scholar
Aupy G, Benoit A, Herault T, Robert Y, Vivien F, Zaidouni D (2013) On the combination of silent error detection and checkpointing. In: PRDC 2013, the 19th IEEE Pacific Rim international symposium on dependable computing. IEEE Computer Society Press
Google Scholar
Aupy G, Robert Y, Vivien F, Zaidouni D (2014) Checkpointing algorithms and fault prediction. J Parallel Distrib Comput 74(2):2048–2064
Article Google Scholar
Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: International conference high performance computing, networking, storage and analysis SC’11
Google Scholar
Benson AR, Schmit S, Schreiber R (2013) Silent error detection in numerical time-stepping schemes. CoRR, abs/1312.2674
Google Scholar
Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK users’ guide. SIAM
Google Scholar
Bosilca G, Delmas R, Dongarra J, Langou J (2009) Algorithm-based fault tolerance applied to high performance computing. J Parallel Distrib Comput 69(4):410–416
Article Google Scholar
Bosilca G, Bouteiller A, Herault T, Robert Y, Dongarra JJ (2014) Assessing the impact of ABFT and checkpoint composite strategies. In: 2014 IEEE international parallel and distributed processing symposium workshops, Phoenix, AZ, USA, May 19–23 2014, pp 679–688
Google Scholar
Bosilca G, Bouteiller A, Brunet E, Cappello F, Dongarra J, Guermouche A, Herault T, Robert Y, Vivien F, Zaidouni D (2014) Unified model for assessing checkpointing protocols at extreme-scale. Concurr Comput Pract Exp 26(17):925–957
Article Google Scholar
Bosilca G, Bouteiller A, Herault T, Robert Y, Dongarra JJ (2015) Composing resilience techniques: ABFT, periodic and incremental checkpointing. IJNC 5(1):2–25
Article Google Scholar
Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of SC’11
Google Scholar
Bouteiller A, Herault T, Krawezik G, Lemarinier P, Cappello F (2006) MPICH-V: a multiprotocol fault tolerant MPI. IJHPCA 20(3):319–333
Google Scholar
Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. Concurr Comput Pract Exp 22(16):2196–2211
Article Google Scholar
Bouteiller A, Herault T, Bosilca G, Dongarra JJ (2011) Correlated set coordination in fault tolerant message logging protocols. In: Proceedings of Euro-Par’11 (II). LNCS, vol 6853. Springer, pp 51–64
Google Scholar
Bouteiller A, Herault T, Bosilca G, Du P, Dongarra J (2015) Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans Parallel Comput 1(2):10:1–10:28
Google Scholar
Bronevetsky G, de Supinski B (2008) Soft error vulnerability of iterative linear algebra methods. In: Proceedings of 22nd international conference on supercomputing, ICS ’08. ACM, pp 155–164
Google Scholar
Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Montana State University
Google Scholar
Casanova H, Robert Y, Vivien F, Zaidouni D (2012) Combining process replication and checkpointing for resilience on exascale systems. Research report RR-7951, INRIA
Google Scholar
Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63–75
Google Scholar
Chen Z (2013) Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of 18th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’13. ACM, pp 167–176
Google Scholar
Chen Z, Dongarra J (2006) Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: Proceedings of the 20th international conference on parallel and distributed processing, IPDPS’06, Washington, DC, USA. IEEE Computer Society, pp 97–97
Google Scholar
Chen Z, Dongarra J (2008) Algorithm-based fault tolerance for fail-stop failures. IEEE TPDS 19(12):1628–1641
Google Scholar
Choi J, Demmel J, Dhillon I, Dongarra J, Ostrouchov S, Petitet A, Stanley K, Walker D, Whaley R (1996) ScaLAPACK: a portable linear algebra library for distributed memory computers-design issues and performance. Comput Phys Commun 97(1–2):1–15
Article MATH Google Scholar
Daly JT (2004) A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22(3):303–312
Article Google Scholar
Davies T, Karlsson C, Liu H, Ding C, Chen Z (2011) High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the international conference on supercomputing, ICS ’11. ACM, New York, pp 162–171
Google Scholar
Dongarra J, Beckman P, Aerts P, Cappello F, Lippert T, Matsuoka S, Messina P, Moore T, Stevens R, Trefethen A, Valero M (2009) The international exascale software project: a call to cooperative action by the global high-performance community. Int J High Perform Comput Appl 23(4):309–322
Google Scholar
Dongarra J, Herault T, Robert Y (2014) Performance and reliability trade-offs for the double checkpointing algorithm. Int J Netw Comput 4(1):23–41
Google Scholar
Du P, Bouteiller A, Bosilca G, Herault T, Dongarra J (2012) Algorithm-based fault tolerance for dense matrix factorizations. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP 2012, New Orleans, LA, USA, 25–29 February 2012, pp 225–234
Google Scholar
Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C (2012) Combining partial redundancy and checkpointing for HPC. In: Proceedings of ICDCS ’12. IEEE Computer Society
Google Scholar
Engelmann C, Ong HH, Scorr SL (2009) The case for modular redundancy in large-scale highh performance computing systems. In: Proceedings of the 8th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 189–194
Google Scholar
Esteban Meneses CLM, Kalé LV (2010) Team-based message logging: preliminary results. In: Workshop resilience in clusters, clouds, and grids (CCGRID 2010)
Google Scholar
Ferreira K, Stearley J, Laros JHI, Oldfield R, Pedretti, K, Brightwell R, Riesen R, Bridges, PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of the ACM/IEEE on supercomputing
Google Scholar
Flajolet P, Grabner PJ, Kirschenhofer P, Prodinger H (1995) On Ramanujan’s Q-function. J Comput Appl Math 58:103–116
Article MathSciNet MATH Google Scholar
Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behavior of large-scale HPC systems. In: Proceedings of IPDPS’12
Google Scholar
Guermouche A, Ropars T, Snir M, Cappello F (to appear) HydEE: failure containment without event logging for large scale send-deterministic MPI applications. In: Proceedings of IEEE IPDPS 2012
Google Scholar
Gustafson JL (1988) Reevaluating Amdahl’s law. IBM Syst J 31(5):532–533
Google Scholar
Gärtner F (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput Surv 31(1):1–26
Google Scholar
Hacker TJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69(7):652–665
Google Scholar
Hakkarinen D, Chen Z (2010) Algorithmic cholesky factorization fault recovery. In: 2010 IEEE International symposium on parallel distributed processing (IPDPS). IEEE, Atlanta, pp 1–10
Google Scholar
Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (BLCR) for linux clusters. In: Proceedings of SciDAC 2006
Google Scholar
Heath T, Martin RP, Nguyen TD (2002) Improving cluster availability using workstation validation. SIGMETRICS Perform Eval Rev 30(1):217–227
Google Scholar
Heien R, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures on large parallel system. In: Proceedings of the IEEE/ACM conference on supercomputing (SC)
Google Scholar
Heroux M, Hoemmen M (2011) Fault-tolerant iterative methods via selective reliability. Research report SAND2011-3915 C, Sandia National Laboratories
Google Scholar
Huang K-H, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput C-33(6):518–528
Google Scholar
Huang K-H, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput 33(6):518–528
Article Google Scholar
Hursey J, Squyres J, Mattox T, Lumsdaine A (2007) The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IEEE international parallel and distributed processing symposium, 2007. IPDPS. pp 1–8
Google Scholar
Hwang AA, Stefanovici IA, Schroeder B (2012) Cosmic rays don’t strike twice: understanding the nature of dram errors and the implications for system design. SIGARCH Comput Archit News 40(1):111–122
Article Google Scholar
Kella O, Stadje W (2006) Superposition of renewal processes and an application to multi-server queues. Stat Probab Lett 76(17):1914–1924
Article MathSciNet MATH Google Scholar
Kingsley G, Beck M, Plank JS (1995) Compiler-assisted checkpoint optimization using SUIF. In: First SUIF compiler workshop
Google Scholar
Kondo D, Chien A, Casanova H (2007) Scheduling task parallel applications for rapid application turnaround on enterprise desktop grids. J Grid Comput 5(4):379–405
Article Google Scholar
Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565
Article MATH Google Scholar
Li C-C, Fuchs W (1990) Catch-compiler-assisted techniques for checkpointing. In: 20th international symposium fault-tolerant computing, 1990. FTCS-20. Digest of papers, pp 74–81
Google Scholar
Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IPDPS’08. IEEE
Google Scholar
Lu G, Zheng Z, Chien AA (2013) When is multi-version checkpointing needed. In: 3rd Workshop for fault-tolerance at extreme scale (FTXS). ACM Press. https://sites.google.com/site/uchicagolssg/lssg/research/gvr
Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200–209
Article MATH Google Scholar
Moody A, Bronevetsky G, Mohror K, Supinski B (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: International conference high performance computing, networking, storage and analysis SC’10
Google Scholar
Moody A, Bronevetsky G, Mohror K, Supinski BR de (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the ACM/IEEE conference SC, pp 1–11
Google Scholar
Ni X, Meneses E, Kalé LV (2012) Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In: Proceedings of IEEE international conference on cluster computing. IEEE Computer Society
Google Scholar
Ni X, Meneses E, Jain N, Kalé LV (2013) ACR: automatic checkpoint/restart for soft and hard error protection. In: Proceedings of international conference high performance computing, networking, storage and analysis, SC ’13. ACM
Google Scholar
O’Gorman T (1994) The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Trans Electron Devices 41(4):553–557
Article Google Scholar
Plank JS, Beck M, Kingsley G (1995) Compiler-assisted memory exclusion for fast checkpointing. IEEE Tech Comm Oper Syst Appl Environ 7:10–14
Google Scholar
Rodríguez G, Martín MJ, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concurr Comput Pract Exp 22(6):749–766
Google Scholar
Ross SM (2009) Introduction to probability models, 8th edn. Academic Press, San Diego
Google Scholar
Sao P, Vuduc R (2013) Self-stabilizing iterative solvers. In: Proceedings of ScalA ’13. ACM
Google Scholar
Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):188–198
Google Scholar
Schroeder B, Gibson GA (2006) A large-scale study of failures in high-performance computing systems. In: Proceedings of DSN, pp 249–258
Google Scholar
Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of ICS ’12. ACM
Google Scholar
Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Article MATH Google Scholar
Yu L, Zheng Z, Lan Z, Coghlan S (2011) Practical online failure prediction for blue gene/p: period-based vs event-driven. In: Dependable systems and networks workshops (DSN-W), pp 259–264
Google Scholar
Zheng G, Shi L, Kalé LV (2004) FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proceedings of IEEE international conference on cluster computing. IEEE Computer Society
Google Scholar
Zheng Z, Lan Z (2009) Reliability-aware scalability models for high performance computing. In: Proceedings of the IEEE conference on cluster computing
Google Scholar
Zheng Z, Lan Z, Gupta R, Coghlan S, Beckman P (2010) A practical failure prediction with location and lead time for blue gene/p. In: Dependable systems and networks workshops (DSN-W), pp 15–22
Google Scholar
Ziegler J, Muhlfeld H, Montrose C, Curtis H, O’Gorman T, Ross J (1996) Accelerated testing for cosmic soft-error rate. IBM J Res Dev 40(1):51–72
Article Google Scholar
Ziegler J, Nelson M, Shell J, Peterson R, Gelderloos C, Muhlfeld H, Montrose C (1998) Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE J Solid-State Circuits 33(2):246–252
Article Google Scholar
Ziegler JF, Curtis HW, Muhlfeld HP, Montrose CJ, Chin B (1996) IBM experiments in soft fails in computer electronics. IBM J Res Dev 40(1):3–18
Article Google Scholar

Download references

Acknowledgments

Yves Robert is with the Institut Universitaire de France. The research presented in this chapter was supported in part by the French ANR (Rescue project) and by contracts with the DOE through the SUPER-SCIDAC project, and the CREST project of the Japan Science and Technology Agency (JST). This chapter has borrowed material from publications co-authored with many colleagues and PhD students, and the authors would like to thank Guillaume Aupy, Anne Benoit, George Bosilca, Aurélien Bouteiller, Aurélien Cavelan, Franck Cappello, Henri Casanova, Amina Guermouche, Saurabh K. Raina, Hongyang Sun, Frédéric Vivien, and Dounia Zaidouni.

Author information

Authors and Affiliations

University of Tennessee, Knoxville, TN, USA
Jack Dongarra, Thomas Herault & Yves Robert
Oak Ridge National Laboratory, Oak Ridge, USA
Jack Dongarra
University of Manchester, Manchester, UK
Jack Dongarra
Ecole Normale Supérieure de Lyon, Lyon, France
Yves Robert

Authors

Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Herault
View author publications
You can also search for this author in PubMed Google Scholar
Yves Robert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Herault .

Editor information

Editors and Affiliations

University of Tennessee, Knoxville, Tennessee, USA
Thomas Herault
Ecole Normale Supérieure de Lyon, Lyon, France
Yves Robert

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dongarra, J., Herault, T., Robert, Y. (2015). Fault Tolerance Techniques for High-Performance Computing. In: Herault, T., Robert, Y. (eds) Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-20943-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-20943-2_1
Published: 02 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20942-5
Online ISBN: 978-3-319-20943-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics