skip to main content
survey

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

Published:04 October 2017Publication History
Skip Abstract Section

Abstract

Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components.

Skip Supplemental Material Section

Supplemental Material

References

  1. Sarah Abdallah, Ali Chehab, Imad H. Elhajj, and Ayman Kayssi. 2012. Stochastic hardware architectures: A survey. In Proceedings of the 2012 International Conference on Energy Aware Computing. 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  2. Rishi Agarwal, Pranav Garg, and Josep Torrellas. 2011. Rebound: Scalable Checkpointing for Coherent Shared Memory. Vol. 39. ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nidhi Aggarwal. 2008. Achieving High Availability With Commodity Hardware and Software. ProQuest.Google ScholarGoogle Scholar
  4. Nidhi Aggarwal, Parthasarathy Ranganathan, Norman P. Jouppi, and James E. Smith. 2007. Configurable isolation: Building high availability systems with commodity multi-core processors. In ACM SIGARCH Computer Architecture News 35, 470--481. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rana Ejaz Ahmed, Robert C. Frazier, and Peter N. Marinos. 1990. Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing (FTCS-20). IEEE, Los Alamitos, CA, 82--88.Google ScholarGoogle Scholar
  6. Robert Aitken, Görschwin Fey, Zbigniew T. Kalbarczyk, Frank Reichenbach, and Matteo Sonza Reorda. 2013. Reliability analysis reloaded: How will we survive? In Proceedings of the Conference on Design, Automation, and Test in Europe. 358--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yiannis Andreopoulos. 2013. Error tolerant multimedia stream processing: There’s plenty of room at the top (of the system stack). IEEE Transactions on Multimedia 15, 2, 291--303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32). 196--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Algirdas Avizienis. 1985. The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering 12, 1491--1501. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Algirdas Avizienis, George C. Gilley, Francis P. Mathur, David A. Rennels, John A. Rohr, and David K. Rubin. 1971. The STAR (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers 100, 11, 1312--1321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1, 11--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Rajanikanth Batchu, Yoginder S. Dandass, Anthony Skjellum, and Murali Beddhu. 2004. MPI/FT: A model-based approach to low-overhead fault tolerant message-passing middleware. Cluster Computing 7, 4, 303--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. A. Bertossi, L. V. Mancini, and F. Rossini. 1999. Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems. IEEE Transactions on Parallel and Distributed Systems 10, 9, 934--945. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Douglas M. Blough, Fadi J. Kurdahi, and Seong Y. Ohm. 1997. Optimal algorithms for recovery point insertion in recoverable microarchitectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 16, 9, 945--955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Andrea Bondavalli, Silvano Chiaradonna, Felicita Di Giandomenico, and Fabrizio Grandoni. 2000. Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Transactions on Computers 49, 3, 230--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6, 10--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fred A. Bower, Daniel J. Sorin, and Sule Ozev. 2005. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 197--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Christian Brehm, Matthias May, Christina Gimmler, and Norbert Wehn. 2012. A case study on error resilient architectures for wireless communication. In Proceedings of the 25th International Conference on Architecture of Computing Systems (ARCS’12). 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Thomas C. Bressoud and Fred B. Schneider. 1996. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems 14, 1, 80--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nicholas P. Carter, Helia Naeimi, and Donald S. Gardner. 2010. Design techniques for cross-layer resilience. In Proceedings of the Conference on Design, Automation, and Test in Europe. 1023--1028. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sayantan Chakravorty, Celso L. Mendes, and Laxmikant V. Kalé. 2006. Proactive fault tolerance in MPI applications via task migration. In Proceedings of the 13th International Conference on High Performance Computing (HiPC’06). 485--496. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Vishal Chandra. 2014. Monitoring reliability in embedded processors—a multi-layer view. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’04). IEEE, Los Alamitos, CA, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. Mani Chandy and Chittoor V. Ramamoorthy. 1972. Rollback and recovery strategies for computer programs. IEEE Transactions on Computers 100, 6, 546--556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mengly Chean and Jose A. B. Fortes. 1990. A taxonomy of reconfiguration techniques for fault-tolerant processor arrays. Computer 23, 1, 55--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yunji Chen, Shijin Zhang, Qi Guo, Ling Li, Ruiyang Wu, and Tianshi Chen. 2015. Deterministic replay: A survey. ACM Computing Surveys 48, 2, Article No. 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hyungmin Cho, Larkhoon Leem, and Subhasish Mitra. 2012. ERSA: Error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4, 546--558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 497--508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. André DeHon, Heather M. Quinn, and Nicholas P. Carter. 2010. Vision for cross-layer optimization to address the dual challenges of energy and reliability. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’10). IEEE, Los Alamitos, CA, 1017--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. M. Dickinson, J. B. Jackson, and G. C. Randa. 1964. Saturn V launch vehicle digital computer and data adapter. In Proceedings of the 1964 Fall Joint Computer Conference, Part I (AFIPS’64). ACM, New York, NY, 501--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Björn Döbel, Hermann Härtig, and Michael Engel. 2012. Operating system support for redundant multithreading. In Proceedings of the 10th ACM International Conference on Embedded Software. ACM, New York, NY, 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pradeep Dubey. 2005. Recognition, mining and synthesis moves computers to the era of tera. Technology@Intel Magazine 9, 2, 1--10.Google ScholarGoogle Scholar
  32. Jason Duell. 2005. The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart. Lawrence Berkeley National Laboratory.Google ScholarGoogle Scholar
  33. Nikil Dutt, Puneet Gupta, Alex Nicolau, Abbas BanaiyanMofrad, Mark Gottscho, and Majid Shoushtari. 2014. Multi-layer memory resiliency. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). IEEE, Los Alamitos, CA, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ifeanyi Egwutuoha, David Levy, Bran Selic, and Shiping Chen. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems.Journal of Supercomputing 65, 3, 1302--1326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Elmootazbellah Nabil Elnozahy. 1994. Manetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and Process Replication. Ph.D. Dissertation. Rice University, Houston, TX.Google ScholarGoogle Scholar
  36. Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 3, 375--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis. IEEE, Los Alamitos, CA, 78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Ghosh, R. Melhem, and D. Mosse. 1994. Fault-tolerant scheduling on a hard real-time multiprocessor system. In Proceedings of the 1994 8th International Parallel Processing Symposium. 775--782. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sunondo Ghosh, Rami Melhem, Daniel Mossé, and Joydeep Sen Sarma. 1998. Fault-tolerant rate-monotonic scheduling. Real-Time Systems 15, 2, 149--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. Gizopoulos, M. Psarakis, S. V. Adve, P. Ramachandran, S. K. S. Hari, D. Sorin, A. Meixner, A. Biswas, and X. Vera. 2011. Architectures for online error detection and recovery in multicore processors. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’11). 1--6.Google ScholarGoogle Scholar
  41. Dennis Gnad, Muhammad Shafique, Florian Kriebel, Semeen Rehman, Duo Sun, and Jörg Henkel. 2015. Hayat: Harnessing dark silicon and variability for aging deceleration and balancing. In Proceedings of the 52nd Annual Design Automation Conference (DAC’15). ACM, New York, NY, Article No. 180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 98--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Meeta S. Gupta, Jude A. Rivers, Pradip Bose, Gu-Yeon Wei, and David Brooks. 2009. Tribeca: Design for PVT variations with local recovery and fine-grained adaptation. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 435--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. 2008. The StageNet fabric for constructing resilient multicore systems. In Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 141--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Richard W. Hamming. 1950. Error detecting and error correcting codes. Bell System Technical Journal 29, 2, 147--160.Google ScholarGoogle ScholarCross RefCross Ref
  46. Haibo He, Sheng Chen, Kang Li, and Xin Xu. 2011. Incremental learning from stream data. IEEE Transactions on Neural Networks 22, 12, 1901--1914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wu He and Li Da Xu. 2014. Integration of distributed enterprise applications: A survey. IEEE Transactions on Industrial Informatics 10, 1, 35--42.Google ScholarGoogle ScholarCross RefCross Ref
  48. Rajamohana Hegde and Naresh R. Shanbhag. 2001. Soft digital signal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 9, 6, 813--823. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Erik Hendriks. 2002. VMADump. Retrieved July 23, 2017, from https://upc-bugs.lbl.gov/blcr/vmadump4/vmadump_arm.c.Google ScholarGoogle Scholar
  50. Jörg Henkel, Lars Bauer, Hongyan Zhang, Semeen Rehman, and Muhammad Shafique. 2014. Multi-layer dependability: From microarchitecture to application level. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article No. 47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach. Elsevier. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Amr Hussien, Muhammed S. Khairy, Amin Khajeh, Kiarash Amiri, Ahmed M. Eltawil, and Fadi J. Kurdahi. 2010. A combined channel and hardware noise resilient Viterbi decoder. In Proceedings of the 2010 Conference Record of the 44th Asilomar Conference on Signals, Systems, and Computers (Asilomar’10). IEEE, Los Alamitos, CA, 395--399.Google ScholarGoogle Scholar
  53. Amr Hussien, Muhammad S. Khairy, Amin Khajeh, Ahmed M. Eltawil, and Fadi J. Kurdahi. 2011. A class of low power error compensation iterative decoders. In Proceedings of the 2011 IEEE Global Telecommunications Conference (GLOBECOM’11). IEEE, Los Alamitos, CA, 1--6.Google ScholarGoogle Scholar
  54. IEEE Standard. 1990. 610.12-1990 - IEEE Standard Glossary of Software Engineering Terminology. Retrieved July 23, 2017, from https://standards.ieee.org/findstds/standards/610.12-1990.htmlGoogle ScholarGoogle Scholar
  55. Intel. 2011. Intel® Xeon® Processor E7 Family: Reliability, Availability, and Serviceability. Technical Report. Data Center Group, Intel Corporation. http://www.intel.com/dam/www/public/us/en/documents/white-papers/xeon-37-family-ras-server-paper.pdf.Google ScholarGoogle Scholar
  56. Casey M. Jeffery and Renato J. O. Figueiredo. 2012. A flexible approach to improving system reliability with virtual lockstep. IEEE Transactions on Dependable and Secure Computing 9, 1, 2--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Doug Jewett. 1991. Integrity S2: A fault-tolerant unix platform. In Proceedings of the 1991 21st International Symposium on Fault-Tolerant Computing (FTCS-21). IEEE, Los Alamitos, CA, 512--519.Google ScholarGoogle ScholarCross RefCross Ref
  58. Eric Karl, David Blaauw, Dennis Sylvester, and Trevor Mudge. 2006. Reliability modeling and management in dynamic microprocessor-based systems. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 1057--1060. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Amin Khajeh, Minyoung Kim, Nikil Dutt, Ahmed M. Eltawil, and Fadi J. Kurdahi. 2012. Error-aware algorithm/architecture coexploration for video over wireless applications. ACM Transactions on Embedded Computing Systems 11, 1, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Florian Kriebel, Semeen Rehman, Duo Sun, Muhammad Shafique, and Jörg Henkel. 2014. ASER: Adaptive soft error resilience for reliability-heterogeneous processors in the dark silicon era. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article No. 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. C. Mani Krishna and Kang G. Shin. 1986. On scheduling tasks with a quick recovery from failure. IEEE Transactions on Computers 100, 5, 448--455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. K. J. Kuhn, M. D. Giles, D. Becher, P. Kolar, A. Kornfeld, R. Kotlyar, S. T. Ma, A. Maheshwari, and S. Mudanai. 2011. Process technology variation. IEEE Transactions on Electron Devices 58, 8, 2197--2208.Google ScholarGoogle ScholarCross RefCross Ref
  63. Chung-Chi Jim Li, Elliot M. Stewart, and W. Kent Fuchs. 1994. Compiler-assisted full checkpointing. Software: Practice and Experience 24, 10, 871--886. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Tuo Li, Muhammad Shafique, Semeen Rehman, Jude Angelo Ambrose, Jörg Henkel, and Sri Parameswaran. 2013a. DHASER: Dynamic heterogeneous adaptation for soft-error resiliency in ASIP-based multi-core systems. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’13). Los Alamitos, CA, 646--653. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Tuo Li, Muhammad Shafique, Semeen Rehman, Swarnalatha Radhakrishnan, Roshan Ragel, Jude Angelo Ambrose, Jörg Henkel, and Sri Parameswaran. 2013b. CSER: HW/SW configurable soft-error resiliency for application specific instruction-set processors. In Proceedings of the 2013 Design, Automation, and Test in Europe Conference and Exhibition (DATE’13). 707--712. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Frank Liberato, Rami Melhem, and Daniel Mossé. 2000. Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems. IEEE Transactions on Computers 49, 9, 906--914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Jane W. S. Liu, Wei-Kuan Shih, Kwei-Jay Lin, Riccardo Bettati, and Jen-Yao Chung. 1994. Imprecise computations. Proceedings of the IEEE 82, 1, 83--94.Google ScholarGoogle ScholarCross RefCross Ref
  68. Klaus Lochmann and Andreas Goeb. 2011. A unifying model for software quality. In Proceedings of the 8th International Workshop on Software Quality (WoSQ’11). ACM, New York, NY, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Matthias May, Matthias Alles, and Norbert Wehn. 2008. A case study in reliability-aware design: A resilient LDPC code decoder. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’08). ACM, New York, NY, 456--461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. J. W. McPherson. 2006. Reliability challenges for 45Nm and beyond. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 176--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. A. Meixner and D. J. Sorin. 2008. Detouring: Translating software to circumvent hard faults in simple cores. In Proceedings of the 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN’08). 80--89.Google ScholarGoogle Scholar
  72. Jim Mitchell, Daniel Henderson, George Ahrens, and Julissa Villareal. 2009. IBM Power Platform Reliability, Availability and Serviceability (RAS). Technical Report POW03003.doc. International Business Machines Corporation. https//www-304.ibm.com/webapp/set2/sas/f/lopdiags/info/Power6RASOverview.pdf.Google ScholarGoogle Scholar
  73. Sparsh Mittal and Jeffrey Vetter. 2015. A survey of techniques for modeling and improving reliability of computing systems. IEEE Transactions on Parallel and Distributed Systems 27, 4, 1226--1238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Sriram Narayanan, John Sartori, Rakesh Kumar, and Douglas L. Jones. 2010. Scalable stochastic processors. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’10). 335--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. A. Orailoglu and R. Karri. 1994. Coactive scheduling and checkpoint determination during high level synthesis of self-recovering microarchitectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2, 3, 304--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Krishna Palem and Avinash Lingamneni. 2012. What to do about the end of Moore’s law, probably! In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, New York, NY, 924--929. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Mihir Pandya and Miroslaw Malek. 1998. Minimum achievable utilization for fault-tolerant processing of periodic tasks. IEEE Transactions on Computers 47, 10, 1102--1112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Aashish Pant, Puneet Gupta, and Mihaela Van Der Schaar. 2012. AppAdapt: Opportunistic application adaptation in presence of hardware variation. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 11, 1986--1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Matthias Pflanz and Heinrich Theodor Vierhaus. 2001. Online check and recovery techniques for dependable embedded processors. IEEE Micro 5, 24--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. 1994. Libckpt: Transparent Checkpointing Under Unix. Department of Computer Science, University of Tennessee, Knoxville.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Stefan Poledna. 1996. Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. Kluwer Academic, Norwell, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Stefan Poledna. 2007. System aspects of dependable systems. Lecture Notes on Dependable Computer Systems. https://ti.tuwien.ac.at/cps/teaching/courses/dependable-systems/slides/6a-DCS-system-aspects.pdf.Google ScholarGoogle Scholar
  83. Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubhendu S. Mukherjee. 2009. Architectural core salvaging in a multi-core processor for hard-error tolerance. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Abbas Rahimi, Andrea Marongiu, Paolo Burgio, Rajesh K. Gupta, and Luca Benini. 2013. Variation-tolerant OpenMP tasking on tightly-coupled processor clusters. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’13). IEEE, Los Alamitos, CA, 541--546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Balkrishna Ramkumar and Volker Strumpen. 1997. Portable checkpointing for heterogeneous architectures. In Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing (FTCS-27). IEEE, Los Alamitos, CA, 58--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. B. Randell, P. Lee, and P. C. Treleaven. 1978. Reliability issues in computing system design. ACM Computing Surveys 10, 2, 123--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Joydeep Ray, James C. Hoe, and Babak Falsafi. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 214--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Vijay Janapa Reddi, David Z. Pan, Sani R. Nassif, and Keith A. Bowman. 2012. Robust and resilient designs from the bottom-up: Technology, CAD, circuit, and system issues. In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). IEEE, Los Alamitos, CA, 7--16.Google ScholarGoogle Scholar
  90. Semeen Rehman, Florian Kriebel, Duo Sun, Muhammad Shafique, and Jörg Henkel. 2014. dTune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article No. 84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Semeen Rehman, Muhammad Shafique, Pau Vilimelis Aceituno, Florian Kriebel, Jian-Jia Chen, and Jörg Henkel. 2013. Leveraging variable function resilience for selective software reliability on unreliable hardware. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’13). 1759--1764. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jörg Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. In Proceedings of the 7th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, NY, 237--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jörg Henkel. 2012. Raise: Reliability-aware instruction scheduling for unreliable hardware. In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). IEEE, Los Alamitos, CA, 671--676.Google ScholarGoogle ScholarCross RefCross Ref
  94. George A. Reis, Jonathan Chang, and David I. August. 2007. Automatic instruction-level software-only recovery. IEEE Micro 1, 36--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Jude A. Rivers, Meeta S. Gupta, Jeonghee Shin, Prabhakar N. Kudva, and Pradip Bose. 2011. Error tolerance in server class processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 7, 945--959. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Dimitrios Rodopoulos, Georgia Psychou, Mohamed M. Sabry, Francky Catthoor, Antonis Papanikolaou, Dimitrios Soudris, Tobias G. Noll, and David Atienza. 2015. Classification framework for analysis and modeling of physically induced reliability violations. ACM Computing Surveys 47, 3, Article No. 38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Bogdan F. Romanescu and Daniel J. Sorin. 2008. Core cannibalization architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 43--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Tajana Simunic Rosing, Kresimir Mihic, and Giovanni De Micheli. 2007. Power and reliability management of SoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15, 4, 391--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. E. Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing. 84--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Goutam Kumar Saha. 2006. Software based fault tolerance: A survey. Ubiquity 2006, Article No. 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Adrian Sampson, James Bornholt, and Luis Ceze. 2015. Hardware-software co-design: Not just a cliché. In LIPIcs-Leibniz International Proceedings in Informatics, Vol. 32. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google ScholarGoogle Scholar
  102. Jose Carlos Sancho, Fabrizio Petrini, Kei Davis, Roberto Gioiosa, and Song Jiang. 2005. Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos, CA, 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA challenges in the dark silicon era. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). IEEE, Los Alamitos, CA, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Jeonghee Shin, Victor Zyuban, Pradip Bose, and Timothy M. Pinkston. 2008. A proactive wearout recovery approach for exploiting microarchitectural redundancy to extend cache SRAM lifetime. ACM SIGARCH Computer Architecture News 36, 353--362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Philip P. Shirvani, Nirmal R. Saxena, and Edward J. McCluskey. 2000. Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability 49, 3, 273--284.Google ScholarGoogle ScholarCross RefCross Ref
  106. D. P. Siewiorek. 1990. Fault tolerance in commercial computers. Computer 23, 7, 26--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. D. Siewiorek and R. Swarz. 1982. The Theory and Practice of Reliable System Design. Digital Press.Google ScholarGoogle Scholar
  108. Joseph Slember and Priya Narasimhan. 2006. Living with nondeterminism in replicated middleware applications. In Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware. 81--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. J. Hamilton Slye and Elmootazbellah Nabil Elnozahy. 1996. Supporting nondeterministic execution in fault-tolerant systems. In Proceedings of the IEEE Annual Symposium on Fault-Tolerant Computing. 250--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 2002 29th Annual International Symposium on Computer Architecture. 123--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Brinkley Sprunt, Lui Sha, and John Lehoczky. 1989. Scheduling Sporadic and Aperiodic Events in a Hard Real-Time System. Technical Report. DTIC Document.Google ScholarGoogle Scholar
  112. Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. 2005. Exploiting Structural duplication for lifetime reliability enhancement. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). IEEE, Los Alamitos, CA, 520--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Hongbin Sun, Pengju Ren, Nanning Zheng, Tong Zhang, and Tao Li. 2011. Architecting high-performance energy-efficient soft error resilient cache under 3D integration technology. Microprocessors and Microsystems 35, 4, 371--381. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. Karthik Sundaramoorthy, Zach Purser, and Eric Rotenburg. 2000. Slipstream processors: Improving both performance and fault tolerance. ACM SIGARCH Computer Architecture News 28, 257--268. Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. Sumant Tambe. 2010. Model-Driven Fault-Tolerance Provisioning for Component-Based Distributed Real-Time Embedded Systems. Ph.D. Dissertation. Vanderbilt University, Nashville, TN. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Olivier Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. ACM SIGARCH Computer Architecture News 40, 3, 356--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. James E. Tomayko. 1986. Lessons Learned in Creating Spacecraft Computer Systems: Implications for Using Ada for the Space Station. Technical Report. Software Engineering Institute, Carnegie-Mellon University, Pittsburgh, PA.Google ScholarGoogle Scholar
  118. Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. ACM SIGARCH Computer Architecture News 24, 191--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Shyamsundar Venkataraman, Rui Santos, Akash Kumar, and Jasper Kuijsten. 2015. Hardware task migration module for improved fault tolerance and predictability. In Proceedings of the 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’15). IEEE, Los Alamitos, CA, 197--202.Google ScholarGoogle ScholarCross RefCross Ref
  120. John Von Neumann. 1956. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata Studies 34, 43--98.Google ScholarGoogle Scholar
  121. Nicholas J. Wang and Sanjay J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3, 188--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Kun-Lung Wu and W. Kent Fuchs. 1990. Recoverable distributed shared virtual memory. IEEE Transactions on Computers 39, 4, 460--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Kun-Lung Wu, W. Kent Fuchs, and Janak H. Patel. 1990. Error recovery in shared memory multiprocessors using private caches. IEEE Transactions on Parallel and Distributed Systems 1, 2, 231--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Jun Yan and Wei Zhang. 2005. Compiler-guided register reliability improvement against soft errors. In Proceedings of the 5th ACM International Conference on Embedded Software. ACM, New York, NY, 203--209. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 50, Issue 4
      July 2018
      531 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/3135069
      • Editor:
      • Sartaj Sahni
      Issue’s Table of Contents

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 October 2017
      • Accepted: 1 April 2017
      • Revised: 1 December 2016
      • Received: 1 May 2016
      Published in csur Volume 50, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • survey
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader