survey

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

Authors:
Georgia Psychou

EECS, RWTH Aachen, Aachen, Germany

EECS, RWTH Aachen, Aachen, Germany

0000-0002-8341-8343
View Profile

,
Dimitrios Rodopoulos

IMEC, Leuven, Belgium

IMEC, Leuven, Belgium
View Profile

,
Mohamed M. Sabry

ESL, EPFL, Lausanne, Switzerland

ESL, EPFL, Lausanne, Switzerland
View Profile

,
Tobias Gemmeke

IDS, RWTH Aachen; Formerly Holst Center/IMEC, Aachen, Germany

IDS, RWTH Aachen; Formerly Holst Center/IMEC, Aachen, Germany
View Profile

,
David Atienza

ESL, EPFL, Lausanne, Switzerland

ESL, EPFL, Lausanne, Switzerland
View Profile

,
Tobias G. Noll

EECS, RWTH Aachen, Aachen, Germany

EECS, RWTH Aachen, Aachen, Germany
View Profile

,
Francky Catthoor

IMEC 8 KU Leuven, Leuven, Belgium

IMEC 8 KU Leuven, Leuven, Belgium
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 50 Issue 4Article No.: 50pp 1–38https://doi.org/10.1145/3092699

Published:04 October 2017Publication History

ACM Computing Surveys

Abstract

Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components.

Supplemental Material

Available for Download

zip

psychou.zip (5.5 MB)

Supplemental movie, appendix, image and software files for, Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

References

Sarah Abdallah, Ali Chehab, Imad H. Elhajj, and Ayman Kayssi. 2012. Stochastic hardware architectures: A survey. In Proceedings of the 2012 International Conference on Energy Aware Computing. 1--6.Google ScholarCross Ref
Rishi Agarwal, Pranav Garg, and Josep Torrellas. 2011. Rebound: Scalable Checkpointing for Coherent Shared Memory. Vol. 39. ACM, New York, NY. Google ScholarDigital Library
Nidhi Aggarwal. 2008. Achieving High Availability With Commodity Hardware and Software. ProQuest.Google Scholar
Nidhi Aggarwal, Parthasarathy Ranganathan, Norman P. Jouppi, and James E. Smith. 2007. Configurable isolation: Building high availability systems with commodity multi-core processors. In ACM SIGARCH Computer Architecture News 35, 470--481. Google ScholarDigital Library
Rana Ejaz Ahmed, Robert C. Frazier, and Peter N. Marinos. 1990. Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing (FTCS-20). IEEE, Los Alamitos, CA, 82--88.Google Scholar
Robert Aitken, Görschwin Fey, Zbigniew T. Kalbarczyk, Frank Reichenbach, and Matteo Sonza Reorda. 2013. Reliability analysis reloaded: How will we survive? In Proceedings of the Conference on Design, Automation, and Test in Europe. 358--367. Google ScholarDigital Library
Yiannis Andreopoulos. 2013. Error tolerant multimedia stream processing: There’s plenty of room at the top (of the system stack). IEEE Transactions on Multimedia 15, 2, 291--303. Google ScholarDigital Library
T. M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32). 196--207. Google ScholarDigital Library
Algirdas Avizienis. 1985. The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering 12, 1491--1501. Google ScholarDigital Library
Algirdas Avizienis, George C. Gilley, Francis P. Mathur, David A. Rennels, John A. Rohr, and David K. Rubin. 1971. The STAR (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers 100, 11, 1312--1321. Google ScholarDigital Library
Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1, 11--33. Google ScholarDigital Library
Rajanikanth Batchu, Yoginder S. Dandass, Anthony Skjellum, and Murali Beddhu. 2004. MPI/FT: A model-based approach to low-overhead fault tolerant message-passing middleware. Cluster Computing 7, 4, 303--315. Google ScholarDigital Library
A. A. Bertossi, L. V. Mancini, and F. Rossini. 1999. Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems. IEEE Transactions on Parallel and Distributed Systems 10, 9, 934--945. Google ScholarDigital Library
Douglas M. Blough, Fadi J. Kurdahi, and Seong Y. Ohm. 1997. Optimal algorithms for recovery point insertion in recoverable microarchitectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 16, 9, 945--955. Google ScholarDigital Library
Andrea Bondavalli, Silvano Chiaradonna, Felicita Di Giandomenico, and Fabrizio Grandoni. 2000. Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Transactions on Computers 49, 3, 230--245. Google ScholarDigital Library
S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6, 10--16. Google ScholarDigital Library
Fred A. Bower, Daniel J. Sorin, and Sule Ozev. 2005. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 197--208. Google ScholarDigital Library
Christian Brehm, Matthias May, Christina Gimmler, and Norbert Wehn. 2012. A case study on error resilient architectures for wireless communication. In Proceedings of the 25th International Conference on Architecture of Computing Systems (ARCS’12). 13--24. Google ScholarDigital Library
Thomas C. Bressoud and Fred B. Schneider. 1996. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems 14, 1, 80--107. Google ScholarDigital Library
Nicholas P. Carter, Helia Naeimi, and Donald S. Gardner. 2010. Design techniques for cross-layer resilience. In Proceedings of the Conference on Design, Automation, and Test in Europe. 1023--1028. Google ScholarDigital Library
Sayantan Chakravorty, Celso L. Mendes, and Laxmikant V. Kalé. 2006. Proactive fault tolerance in MPI applications via task migration. In Proceedings of the 13th International Conference on High Performance Computing (HiPC’06). 485--496. Google ScholarDigital Library
Vishal Chandra. 2014. Monitoring reliability in embedded processors—a multi-layer view. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’04). IEEE, Los Alamitos, CA, 1--6. Google ScholarDigital Library
K. Mani Chandy and Chittoor V. Ramamoorthy. 1972. Rollback and recovery strategies for computer programs. IEEE Transactions on Computers 100, 6, 546--556. Google ScholarDigital Library
Mengly Chean and Jose A. B. Fortes. 1990. A taxonomy of reconfiguration techniques for fault-tolerant processor arrays. Computer 23, 1, 55--69. Google ScholarDigital Library
Yunji Chen, Shijin Zhang, Qi Guo, Ling Li, Ruiyang Wu, and Tianshi Chen. 2015. Deterministic replay: A survey. ACM Computing Surveys 48, 2, Article No. 17. Google ScholarDigital Library
Hyungmin Cho, Larkhoon Leem, and Subhasish Mitra. 2012. ERSA: Error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4, 546--558. Google ScholarDigital Library
Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 497--508. Google ScholarDigital Library
André DeHon, Heather M. Quinn, and Nicholas P. Carter. 2010. Vision for cross-layer optimization to address the dual challenges of energy and reliability. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’10). IEEE, Los Alamitos, CA, 1017--1022. Google ScholarDigital Library
M. M. Dickinson, J. B. Jackson, and G. C. Randa. 1964. Saturn V launch vehicle digital computer and data adapter. In Proceedings of the 1964 Fall Joint Computer Conference, Part I (AFIPS’64). ACM, New York, NY, 501--516. Google ScholarDigital Library
Björn Döbel, Hermann Härtig, and Michael Engel. 2012. Operating system support for redundant multithreading. In Proceedings of the 10th ACM International Conference on Embedded Software. ACM, New York, NY, 83--92. Google ScholarDigital Library
Pradeep Dubey. 2005. Recognition, mining and synthesis moves computers to the era of tera. Technology@Intel Magazine 9, 2, 1--10.Google Scholar
Jason Duell. 2005. The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart. Lawrence Berkeley National Laboratory.Google Scholar
Nikil Dutt, Puneet Gupta, Alex Nicolau, Abbas BanaiyanMofrad, Mark Gottscho, and Majid Shoushtari. 2014. Multi-layer memory resiliency. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). IEEE, Los Alamitos, CA, 1--6. Google ScholarDigital Library
Ifeanyi Egwutuoha, David Levy, Bran Selic, and Shiping Chen. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems.Journal of Supercomputing 65, 3, 1302--1326. Google ScholarDigital Library
Elmootazbellah Nabil Elnozahy. 1994. Manetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and Process Replication. Ph.D. Dissertation. Rice University, Houston, TX.Google Scholar
Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 3, 375--408. Google ScholarDigital Library
David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis. IEEE, Los Alamitos, CA, 78. Google ScholarDigital Library
S. Ghosh, R. Melhem, and D. Mosse. 1994. Fault-tolerant scheduling on a hard real-time multiprocessor system. In Proceedings of the 1994 8th International Parallel Processing Symposium. 775--782. Google ScholarDigital Library
Sunondo Ghosh, Rami Melhem, Daniel Mossé, and Joydeep Sen Sarma. 1998. Fault-tolerant rate-monotonic scheduling. Real-Time Systems 15, 2, 149--181. Google ScholarDigital Library
D. Gizopoulos, M. Psarakis, S. V. Adve, P. Ramachandran, S. K. S. Hari, D. Sorin, A. Meixner, A. Biswas, and X. Vera. 2011. Architectures for online error detection and recovery in multicore processors. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’11). 1--6.Google Scholar
Dennis Gnad, Muhammad Shafique, Florian Kriebel, Semeen Rehman, Duo Sun, and Jörg Henkel. 2015. Hayat: Harnessing dark silicon and variability for aging deceleration and balancing. In Proceedings of the 52nd Annual Design Automation Conference (DAC’15). ACM, New York, NY, Article No. 180. Google ScholarDigital Library
M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 98--109. Google ScholarDigital Library
Meeta S. Gupta, Jude A. Rivers, Pradip Bose, Gu-Yeon Wei, and David Brooks. 2009. Tribeca: Design for PVT variations with local recovery and fine-grained adaptation. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 435--446. Google ScholarDigital Library
S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. 2008. The StageNet fabric for constructing resilient multicore systems. In Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 141--151. Google ScholarDigital Library
Richard W. Hamming. 1950. Error detecting and error correcting codes. Bell System Technical Journal 29, 2, 147--160.Google ScholarCross Ref
Haibo He, Sheng Chen, Kang Li, and Xin Xu. 2011. Incremental learning from stream data. IEEE Transactions on Neural Networks 22, 12, 1901--1914. Google ScholarDigital Library
Wu He and Li Da Xu. 2014. Integration of distributed enterprise applications: A survey. IEEE Transactions on Industrial Informatics 10, 1, 35--42.Google ScholarCross Ref
Rajamohana Hegde and Naresh R. Shanbhag. 2001. Soft digital signal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 9, 6, 813--823. Google ScholarDigital Library
Erik Hendriks. 2002. VMADump. Retrieved July 23, 2017, from https://upc-bugs.lbl.gov/blcr/vmadump4/vmadump_arm.c.Google Scholar
Jörg Henkel, Lars Bauer, Hongyan Zhang, Semeen Rehman, and Muhammad Shafique. 2014. Multi-layer dependability: From microarchitecture to application level. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article No. 47. Google ScholarDigital Library
John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach. Elsevier. Google ScholarDigital Library
Amr Hussien, Muhammed S. Khairy, Amin Khajeh, Kiarash Amiri, Ahmed M. Eltawil, and Fadi J. Kurdahi. 2010. A combined channel and hardware noise resilient Viterbi decoder. In Proceedings of the 2010 Conference Record of the 44th Asilomar Conference on Signals, Systems, and Computers (Asilomar’10). IEEE, Los Alamitos, CA, 395--399.Google Scholar
Amr Hussien, Muhammad S. Khairy, Amin Khajeh, Ahmed M. Eltawil, and Fadi J. Kurdahi. 2011. A class of low power error compensation iterative decoders. In Proceedings of the 2011 IEEE Global Telecommunications Conference (GLOBECOM’11). IEEE, Los Alamitos, CA, 1--6.Google Scholar
IEEE Standard. 1990. 610.12-1990 - IEEE Standard Glossary of Software Engineering Terminology. Retrieved July 23, 2017, from https://standards.ieee.org/findstds/standards/610.12-1990.htmlGoogle Scholar
Intel. 2011. Intel^® Xeon^® Processor E7 Family: Reliability, Availability, and Serviceability. Technical Report. Data Center Group, Intel Corporation. http://www.intel.com/dam/www/public/us/en/documents/white-papers/xeon-37-family-ras-server-paper.pdf.Google Scholar
Casey M. Jeffery and Renato J. O. Figueiredo. 2012. A flexible approach to improving system reliability with virtual lockstep. IEEE Transactions on Dependable and Secure Computing 9, 1, 2--15. Google ScholarDigital Library
Doug Jewett. 1991. Integrity S2: A fault-tolerant unix platform. In Proceedings of the 1991 21st International Symposium on Fault-Tolerant Computing (FTCS-21). IEEE, Los Alamitos, CA, 512--519.Google ScholarCross Ref
Eric Karl, David Blaauw, Dennis Sylvester, and Trevor Mudge. 2006. Reliability modeling and management in dynamic microprocessor-based systems. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 1057--1060. Google ScholarDigital Library
Amin Khajeh, Minyoung Kim, Nikil Dutt, Ahmed M. Eltawil, and Fadi J. Kurdahi. 2012. Error-aware algorithm/architecture coexploration for video over wireless applications. ACM Transactions on Embedded Computing Systems 11, 1, 15. Google ScholarDigital Library
Florian Kriebel, Semeen Rehman, Duo Sun, Muhammad Shafique, and Jörg Henkel. 2014. ASER: Adaptive soft error resilience for reliability-heterogeneous processors in the dark silicon era. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article No. 12. Google ScholarDigital Library
C. Mani Krishna and Kang G. Shin. 1986. On scheduling tasks with a quick recovery from failure. IEEE Transactions on Computers 100, 5, 448--455. Google ScholarDigital Library
K. J. Kuhn, M. D. Giles, D. Becher, P. Kolar, A. Kornfeld, R. Kotlyar, S. T. Ma, A. Maheshwari, and S. Mudanai. 2011. Process technology variation. IEEE Transactions on Electron Devices 58, 8, 2197--2208.Google ScholarCross Ref
Chung-Chi Jim Li, Elliot M. Stewart, and W. Kent Fuchs. 1994. Compiler-assisted full checkpointing. Software: Practice and Experience 24, 10, 871--886. Google ScholarDigital Library
Tuo Li, Muhammad Shafique, Semeen Rehman, Jude Angelo Ambrose, Jörg Henkel, and Sri Parameswaran. 2013a. DHASER: Dynamic heterogeneous adaptation for soft-error resiliency in ASIP-based multi-core systems. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’13). Los Alamitos, CA, 646--653. Google ScholarDigital Library
Tuo Li, Muhammad Shafique, Semeen Rehman, Swarnalatha Radhakrishnan, Roshan Ragel, Jude Angelo Ambrose, Jörg Henkel, and Sri Parameswaran. 2013b. CSER: HW/SW configurable soft-error resiliency for application specific instruction-set processors. In Proceedings of the 2013 Design, Automation, and Test in Europe Conference and Exhibition (DATE’13). 707--712. Google ScholarDigital Library
Frank Liberato, Rami Melhem, and Daniel Mossé. 2000. Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems. IEEE Transactions on Computers 49, 9, 906--914. Google ScholarDigital Library
Jane W. S. Liu, Wei-Kuan Shih, Kwei-Jay Lin, Riccardo Bettati, and Jen-Yao Chung. 1994. Imprecise computations. Proceedings of the IEEE 82, 1, 83--94.Google ScholarCross Ref
Klaus Lochmann and Andreas Goeb. 2011. A unifying model for software quality. In Proceedings of the 8th International Workshop on Software Quality (WoSQ’11). ACM, New York, NY, 3--10. Google ScholarDigital Library
Matthias May, Matthias Alles, and Norbert Wehn. 2008. A case study in reliability-aware design: A resilient LDPC code decoder. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’08). ACM, New York, NY, 456--461. Google ScholarDigital Library
J. W. McPherson. 2006. Reliability challenges for 45Nm and beyond. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 176--181. Google ScholarDigital Library
A. Meixner and D. J. Sorin. 2008. Detouring: Translating software to circumvent hard faults in simple cores. In Proceedings of the 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN’08). 80--89.Google Scholar
Jim Mitchell, Daniel Henderson, George Ahrens, and Julissa Villareal. 2009. IBM Power Platform Reliability, Availability and Serviceability (RAS). Technical Report POW03003.doc. International Business Machines Corporation. https//www-304.ibm.com/webapp/set2/sas/f/lopdiags/info/Power6RASOverview.pdf.Google Scholar
Sparsh Mittal and Jeffrey Vetter. 2015. A survey of techniques for modeling and improving reliability of computing systems. IEEE Transactions on Parallel and Distributed Systems 27, 4, 1226--1238. Google ScholarDigital Library
Sriram Narayanan, John Sartori, Rakesh Kumar, and Douglas L. Jones. 2010. Scalable stochastic processors. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’10). 335--338. Google ScholarDigital Library
A. Orailoglu and R. Karri. 1994. Coactive scheduling and checkpoint determination during high level synthesis of self-recovering microarchitectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2, 3, 304--311. Google ScholarDigital Library
Krishna Palem and Avinash Lingamneni. 2012. What to do about the end of Moore’s law, probably&excl; In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, New York, NY, 924--929. Google ScholarDigital Library
Mihir Pandya and Miroslaw Malek. 1998. Minimum achievable utilization for fault-tolerant processing of periodic tasks. IEEE Transactions on Computers 47, 10, 1102--1112. Google ScholarDigital Library
Aashish Pant, Puneet Gupta, and Mihaela Van Der Schaar. 2012. AppAdapt: Opportunistic application adaptation in presence of hardware variation. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 11, 1986--1996. Google ScholarDigital Library
Matthias Pflanz and Heinrich Theodor Vierhaus. 2001. Online check and recovery techniques for dependable embedded processors. IEEE Micro 5, 24--40. Google ScholarDigital Library
James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. 1994. Libckpt: Transparent Checkpointing Under Unix. Department of Computer Science, University of Tennessee, Knoxville.Google ScholarDigital Library
Stefan Poledna. 1996. Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. Kluwer Academic, Norwell, MA. Google ScholarDigital Library
Stefan Poledna. 2007. System aspects of dependable systems. Lecture Notes on Dependable Computer Systems. https://ti.tuwien.ac.at/cps/teaching/courses/dependable-systems/slides/6a-DCS-system-aspects.pdf.Google Scholar
Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubhendu S. Mukherjee. 2009. Architectural core salvaging in a multi-core processor for hard-error tolerance. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). 93--104. Google ScholarDigital Library
Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 111--122. Google ScholarDigital Library
Abbas Rahimi, Andrea Marongiu, Paolo Burgio, Rajesh K. Gupta, and Luca Benini. 2013. Variation-tolerant OpenMP tasking on tightly-coupled processor clusters. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’13). IEEE, Los Alamitos, CA, 541--546. Google ScholarDigital Library
Balkrishna Ramkumar and Volker Strumpen. 1997. Portable checkpointing for heterogeneous architectures. In Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing (FTCS-27). IEEE, Los Alamitos, CA, 58--67. Google ScholarDigital Library
B. Randell, P. Lee, and P. C. Treleaven. 1978. Reliability issues in computing system design. ACM Computing Surveys 10, 2, 123--165. Google ScholarDigital Library
Joydeep Ray, James C. Hoe, and Babak Falsafi. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 214--224. Google ScholarDigital Library
Vijay Janapa Reddi, David Z. Pan, Sani R. Nassif, and Keith A. Bowman. 2012. Robust and resilient designs from the bottom-up: Technology, CAD, circuit, and system issues. In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). IEEE, Los Alamitos, CA, 7--16.Google Scholar
Semeen Rehman, Florian Kriebel, Duo Sun, Muhammad Shafique, and Jörg Henkel. 2014. dTune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article No. 84. Google ScholarDigital Library
Semeen Rehman, Muhammad Shafique, Pau Vilimelis Aceituno, Florian Kriebel, Jian-Jia Chen, and Jörg Henkel. 2013. Leveraging variable function resilience for selective software reliability on unreliable hardware. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’13). 1759--1764. Google ScholarDigital Library
Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jörg Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. In Proceedings of the 7th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, NY, 237--246. Google ScholarDigital Library
Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jörg Henkel. 2012. Raise: Reliability-aware instruction scheduling for unreliable hardware. In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). IEEE, Los Alamitos, CA, 671--676.Google ScholarCross Ref
George A. Reis, Jonathan Chang, and David I. August. 2007. Automatic instruction-level software-only recovery. IEEE Micro 1, 36--47. Google ScholarDigital Library
Jude A. Rivers, Meeta S. Gupta, Jeonghee Shin, Prabhakar N. Kudva, and Pradip Bose. 2011. Error tolerance in server class processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 7, 945--959. Google ScholarDigital Library
Dimitrios Rodopoulos, Georgia Psychou, Mohamed M. Sabry, Francky Catthoor, Antonis Papanikolaou, Dimitrios Soudris, Tobias G. Noll, and David Atienza. 2015. Classification framework for analysis and modeling of physically induced reliability violations. ACM Computing Surveys 47, 3, Article No. 38. Google ScholarDigital Library
Bogdan F. Romanescu and Daniel J. Sorin. 2008. Core cannibalization architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 43--51. Google ScholarDigital Library
Tajana Simunic Rosing, Kresimir Mihic, and Giovanni De Micheli. 2007. Power and reliability management of SoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15, 4, 391--403. Google ScholarDigital Library
E. Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing. 84--91. Google ScholarDigital Library
Goutam Kumar Saha. 2006. Software based fault tolerance: A survey. Ubiquity 2006, Article No. 1. Google ScholarDigital Library
Adrian Sampson, James Bornholt, and Luis Ceze. 2015. Hardware-software co-design: Not just a cliché. In LIPIcs-Leibniz International Proceedings in Informatics, Vol. 32. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google Scholar
Jose Carlos Sancho, Fabrizio Petrini, Kei Davis, Roberto Gioiosa, and Song Jiang. 2005. Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos, CA, 8. Google ScholarDigital Library
Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA challenges in the dark silicon era. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). IEEE, Los Alamitos, CA, 1--6. Google ScholarDigital Library
Jeonghee Shin, Victor Zyuban, Pradip Bose, and Timothy M. Pinkston. 2008. A proactive wearout recovery approach for exploiting microarchitectural redundancy to extend cache SRAM lifetime. ACM SIGARCH Computer Architecture News 36, 353--362. Google ScholarDigital Library
Philip P. Shirvani, Nirmal R. Saxena, and Edward J. McCluskey. 2000. Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability 49, 3, 273--284.Google ScholarCross Ref
D. P. Siewiorek. 1990. Fault tolerance in commercial computers. Computer 23, 7, 26--37. Google ScholarDigital Library
D. Siewiorek and R. Swarz. 1982. The Theory and Practice of Reliable System Design. Digital Press.Google Scholar
Joseph Slember and Priya Narasimhan. 2006. Living with nondeterminism in replicated middleware applications. In Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware. 81--100. Google ScholarDigital Library
J. Hamilton Slye and Elmootazbellah Nabil Elnozahy. 1996. Supporting nondeterministic execution in fault-tolerant systems. In Proceedings of the IEEE Annual Symposium on Fault-Tolerant Computing. 250--259. Google ScholarDigital Library
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 2002 29th Annual International Symposium on Computer Architecture. 123--134. Google ScholarDigital Library
Brinkley Sprunt, Lui Sha, and John Lehoczky. 1989. Scheduling Sporadic and Aperiodic Events in a Hard Real-Time System. Technical Report. DTIC Document.Google Scholar
Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. 2005. Exploiting Structural duplication for lifetime reliability enhancement. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). IEEE, Los Alamitos, CA, 520--531. Google ScholarDigital Library
Hongbin Sun, Pengju Ren, Nanning Zheng, Tong Zhang, and Tao Li. 2011. Architecting high-performance energy-efficient soft error resilient cache under 3D integration technology. Microprocessors and Microsystems 35, 4, 371--381. Google ScholarDigital Library
Karthik Sundaramoorthy, Zach Purser, and Eric Rotenburg. 2000. Slipstream processors: Improving both performance and fault tolerance. ACM SIGARCH Computer Architecture News 28, 257--268. Google ScholarDigital Library
Sumant Tambe. 2010. Model-Driven Fault-Tolerance Provisioning for Component-Based Distributed Real-Time Embedded Systems. Ph.D. Dissertation. Vanderbilt University, Nashville, TN. Google ScholarDigital Library
Olivier Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. ACM SIGARCH Computer Architecture News 40, 3, 356--367. Google ScholarDigital Library
James E. Tomayko. 1986. Lessons Learned in Creating Spacecraft Computer Systems: Implications for Using Ada for the Space Station. Technical Report. Software Engineering Institute, Carnegie-Mellon University, Pittsburgh, PA.Google Scholar
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. ACM SIGARCH Computer Architecture News 24, 191--202. Google ScholarDigital Library
Shyamsundar Venkataraman, Rui Santos, Akash Kumar, and Jasper Kuijsten. 2015. Hardware task migration module for improved fault tolerance and predictability. In Proceedings of the 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’15). IEEE, Los Alamitos, CA, 197--202.Google ScholarCross Ref
John Von Neumann. 1956. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata Studies 34, 43--98.Google Scholar
Nicholas J. Wang and Sanjay J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3, 188--201. Google ScholarDigital Library
Kun-Lung Wu and W. Kent Fuchs. 1990. Recoverable distributed shared virtual memory. IEEE Transactions on Computers 39, 4, 460--469. Google ScholarDigital Library
Kun-Lung Wu, W. Kent Fuchs, and Janak H. Patel. 1990. Error recovery in shared memory multiprocessors using private caches. IEEE Transactions on Parallel and Distributed Systems 1, 2, 231--240. Google ScholarDigital Library
Jun Yan and Wei Zhang. 2005. Compiler-guided register reliability improvement against soft errors. In Proceedings of the 5th ACM International Conference on Embedded Software. ACM, New York, NY, 203--209. Google ScholarDigital Library

Index Terms

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks

Recommendations

Soft core based embedded systems in critical aerospace applications

There is an increasing interest in the aerospace industry to reduce the cost of the systems by means of using Commercial Off The Shelf (COTS) devices. The engineering of novel microsatellites and nanosatellites are clear examples of this new trend. ...
Read More
Hypergraph-Cover Diversity for Maximally-Resilient Reconfigurable Systems
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

Scaling trends of reconfigurable hardware (RH) and their design flexibility have proliferated their use in dependability-critical embedded applications. Although their reconfigurability can enable significant fault tolerance, due to the complexity of ...
Read More
Understanding and Mitigating Hardware Failures in Deep Learning Training Systems
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture

Deep neural network (DNN) training workloads are increasingly susceptible to hardware failures in datacenters. For example, Google experienced "mysterious, difficult to identify problems" in their TPU training systems due to hardware failures [7]. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 50, Issue 4
July 2018
531 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3135069
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 October 2017
- Accepted: 1 April 2017
- Revised: 1 December 2016
- Received: 1 May 2016
Published in csur Volume 50, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Resilience
fault tolerance
mitigation
reliability
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 363
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

ACM Computing Surveys

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Soft core based embedded systems in critical aerospace applications

Hypergraph-Cover Diversity for Maximally-Resilient Reconfigurable Systems

Understanding and Mitigating Hardware Failures in Deep Learning Training Systems