Abstract
Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems
- Sarah Abdallah, Ali Chehab, Imad H. Elhajj, and Ayman Kayssi. 2012. Stochastic hardware architectures: A survey. In Proceedings of the 2012 International Conference on Energy Aware Computing. 1--6.Google ScholarCross Ref
- Rishi Agarwal, Pranav Garg, and Josep Torrellas. 2011. Rebound: Scalable Checkpointing for Coherent Shared Memory. Vol. 39. ACM, New York, NY. Google ScholarDigital Library
- Nidhi Aggarwal. 2008. Achieving High Availability With Commodity Hardware and Software. ProQuest.Google Scholar
- Nidhi Aggarwal, Parthasarathy Ranganathan, Norman P. Jouppi, and James E. Smith. 2007. Configurable isolation: Building high availability systems with commodity multi-core processors. In ACM SIGARCH Computer Architecture News 35, 470--481. Google ScholarDigital Library
- Rana Ejaz Ahmed, Robert C. Frazier, and Peter N. Marinos. 1990. Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing (FTCS-20). IEEE, Los Alamitos, CA, 82--88.Google Scholar
- Robert Aitken, Görschwin Fey, Zbigniew T. Kalbarczyk, Frank Reichenbach, and Matteo Sonza Reorda. 2013. Reliability analysis reloaded: How will we survive? In Proceedings of the Conference on Design, Automation, and Test in Europe. 358--367. Google ScholarDigital Library
- Yiannis Andreopoulos. 2013. Error tolerant multimedia stream processing: There’s plenty of room at the top (of the system stack). IEEE Transactions on Multimedia 15, 2, 291--303. Google ScholarDigital Library
- T. M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32). 196--207. Google ScholarDigital Library
- Algirdas Avizienis. 1985. The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering 12, 1491--1501. Google ScholarDigital Library
- Algirdas Avizienis, George C. Gilley, Francis P. Mathur, David A. Rennels, John A. Rohr, and David K. Rubin. 1971. The STAR (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers 100, 11, 1312--1321. Google ScholarDigital Library
- Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1, 11--33. Google ScholarDigital Library
- Rajanikanth Batchu, Yoginder S. Dandass, Anthony Skjellum, and Murali Beddhu. 2004. MPI/FT: A model-based approach to low-overhead fault tolerant message-passing middleware. Cluster Computing 7, 4, 303--315. Google ScholarDigital Library
- A. A. Bertossi, L. V. Mancini, and F. Rossini. 1999. Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems. IEEE Transactions on Parallel and Distributed Systems 10, 9, 934--945. Google ScholarDigital Library
- Douglas M. Blough, Fadi J. Kurdahi, and Seong Y. Ohm. 1997. Optimal algorithms for recovery point insertion in recoverable microarchitectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 16, 9, 945--955. Google ScholarDigital Library
- Andrea Bondavalli, Silvano Chiaradonna, Felicita Di Giandomenico, and Fabrizio Grandoni. 2000. Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Transactions on Computers 49, 3, 230--245. Google ScholarDigital Library
- S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6, 10--16. Google ScholarDigital Library
- Fred A. Bower, Daniel J. Sorin, and Sule Ozev. 2005. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 197--208. Google ScholarDigital Library
- Christian Brehm, Matthias May, Christina Gimmler, and Norbert Wehn. 2012. A case study on error resilient architectures for wireless communication. In Proceedings of the 25th International Conference on Architecture of Computing Systems (ARCS’12). 13--24. Google ScholarDigital Library
- Thomas C. Bressoud and Fred B. Schneider. 1996. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems 14, 1, 80--107. Google ScholarDigital Library
- Nicholas P. Carter, Helia Naeimi, and Donald S. Gardner. 2010. Design techniques for cross-layer resilience. In Proceedings of the Conference on Design, Automation, and Test in Europe. 1023--1028. Google ScholarDigital Library
- Sayantan Chakravorty, Celso L. Mendes, and Laxmikant V. Kalé. 2006. Proactive fault tolerance in MPI applications via task migration. In Proceedings of the 13th International Conference on High Performance Computing (HiPC’06). 485--496. Google ScholarDigital Library
- Vishal Chandra. 2014. Monitoring reliability in embedded processors—a multi-layer view. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’04). IEEE, Los Alamitos, CA, 1--6. Google ScholarDigital Library
- K. Mani Chandy and Chittoor V. Ramamoorthy. 1972. Rollback and recovery strategies for computer programs. IEEE Transactions on Computers 100, 6, 546--556. Google ScholarDigital Library
- Mengly Chean and Jose A. B. Fortes. 1990. A taxonomy of reconfiguration techniques for fault-tolerant processor arrays. Computer 23, 1, 55--69. Google ScholarDigital Library
- Yunji Chen, Shijin Zhang, Qi Guo, Ling Li, Ruiyang Wu, and Tianshi Chen. 2015. Deterministic replay: A survey. ACM Computing Surveys 48, 2, Article No. 17. Google ScholarDigital Library
- Hyungmin Cho, Larkhoon Leem, and Subhasish Mitra. 2012. ERSA: Error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4, 546--558. Google ScholarDigital Library
- Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 497--508. Google ScholarDigital Library
- André DeHon, Heather M. Quinn, and Nicholas P. Carter. 2010. Vision for cross-layer optimization to address the dual challenges of energy and reliability. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’10). IEEE, Los Alamitos, CA, 1017--1022. Google ScholarDigital Library
- M. M. Dickinson, J. B. Jackson, and G. C. Randa. 1964. Saturn V launch vehicle digital computer and data adapter. In Proceedings of the 1964 Fall Joint Computer Conference, Part I (AFIPS’64). ACM, New York, NY, 501--516. Google ScholarDigital Library
- Björn Döbel, Hermann Härtig, and Michael Engel. 2012. Operating system support for redundant multithreading. In Proceedings of the 10th ACM International Conference on Embedded Software. ACM, New York, NY, 83--92. Google ScholarDigital Library
- Pradeep Dubey. 2005. Recognition, mining and synthesis moves computers to the era of tera. Technology@Intel Magazine 9, 2, 1--10.Google Scholar
- Jason Duell. 2005. The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart. Lawrence Berkeley National Laboratory.Google Scholar
- Nikil Dutt, Puneet Gupta, Alex Nicolau, Abbas BanaiyanMofrad, Mark Gottscho, and Majid Shoushtari. 2014. Multi-layer memory resiliency. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). IEEE, Los Alamitos, CA, 1--6. Google ScholarDigital Library
- Ifeanyi Egwutuoha, David Levy, Bran Selic, and Shiping Chen. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems.Journal of Supercomputing 65, 3, 1302--1326. Google ScholarDigital Library
- Elmootazbellah Nabil Elnozahy. 1994. Manetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and Process Replication. Ph.D. Dissertation. Rice University, Houston, TX.Google Scholar
- Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 3, 375--408. Google ScholarDigital Library
- David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis. IEEE, Los Alamitos, CA, 78. Google ScholarDigital Library
- S. Ghosh, R. Melhem, and D. Mosse. 1994. Fault-tolerant scheduling on a hard real-time multiprocessor system. In Proceedings of the 1994 8th International Parallel Processing Symposium. 775--782. Google ScholarDigital Library
- Sunondo Ghosh, Rami Melhem, Daniel Mossé, and Joydeep Sen Sarma. 1998. Fault-tolerant rate-monotonic scheduling. Real-Time Systems 15, 2, 149--181. Google ScholarDigital Library
- D. Gizopoulos, M. Psarakis, S. V. Adve, P. Ramachandran, S. K. S. Hari, D. Sorin, A. Meixner, A. Biswas, and X. Vera. 2011. Architectures for online error detection and recovery in multicore processors. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’11). 1--6.Google Scholar
- Dennis Gnad, Muhammad Shafique, Florian Kriebel, Semeen Rehman, Duo Sun, and Jörg Henkel. 2015. Hayat: Harnessing dark silicon and variability for aging deceleration and balancing. In Proceedings of the 52nd Annual Design Automation Conference (DAC’15). ACM, New York, NY, Article No. 180. Google ScholarDigital Library
- M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 98--109. Google ScholarDigital Library
- Meeta S. Gupta, Jude A. Rivers, Pradip Bose, Gu-Yeon Wei, and David Brooks. 2009. Tribeca: Design for PVT variations with local recovery and fine-grained adaptation. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 435--446. Google ScholarDigital Library
- S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. 2008. The StageNet fabric for constructing resilient multicore systems. In Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 141--151. Google ScholarDigital Library
- Richard W. Hamming. 1950. Error detecting and error correcting codes. Bell System Technical Journal 29, 2, 147--160.Google ScholarCross Ref
- Haibo He, Sheng Chen, Kang Li, and Xin Xu. 2011. Incremental learning from stream data. IEEE Transactions on Neural Networks 22, 12, 1901--1914. Google ScholarDigital Library
- Wu He and Li Da Xu. 2014. Integration of distributed enterprise applications: A survey. IEEE Transactions on Industrial Informatics 10, 1, 35--42.Google ScholarCross Ref
- Rajamohana Hegde and Naresh R. Shanbhag. 2001. Soft digital signal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 9, 6, 813--823. Google ScholarDigital Library
- Erik Hendriks. 2002. VMADump. Retrieved July 23, 2017, from https://upc-bugs.lbl.gov/blcr/vmadump4/vmadump_arm.c.Google Scholar
- Jörg Henkel, Lars Bauer, Hongyan Zhang, Semeen Rehman, and Muhammad Shafique. 2014. Multi-layer dependability: From microarchitecture to application level. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article No. 47. Google ScholarDigital Library
- John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach. Elsevier. Google ScholarDigital Library
- Amr Hussien, Muhammed S. Khairy, Amin Khajeh, Kiarash Amiri, Ahmed M. Eltawil, and Fadi J. Kurdahi. 2010. A combined channel and hardware noise resilient Viterbi decoder. In Proceedings of the 2010 Conference Record of the 44th Asilomar Conference on Signals, Systems, and Computers (Asilomar’10). IEEE, Los Alamitos, CA, 395--399.Google Scholar
- Amr Hussien, Muhammad S. Khairy, Amin Khajeh, Ahmed M. Eltawil, and Fadi J. Kurdahi. 2011. A class of low power error compensation iterative decoders. In Proceedings of the 2011 IEEE Global Telecommunications Conference (GLOBECOM’11). IEEE, Los Alamitos, CA, 1--6.Google Scholar
- IEEE Standard. 1990. 610.12-1990 - IEEE Standard Glossary of Software Engineering Terminology. Retrieved July 23, 2017, from https://standards.ieee.org/findstds/standards/610.12-1990.htmlGoogle Scholar
- Intel. 2011. Intel® Xeon® Processor E7 Family: Reliability, Availability, and Serviceability. Technical Report. Data Center Group, Intel Corporation. http://www.intel.com/dam/www/public/us/en/documents/white-papers/xeon-37-family-ras-server-paper.pdf.Google Scholar
- Casey M. Jeffery and Renato J. O. Figueiredo. 2012. A flexible approach to improving system reliability with virtual lockstep. IEEE Transactions on Dependable and Secure Computing 9, 1, 2--15. Google ScholarDigital Library
- Doug Jewett. 1991. Integrity S2: A fault-tolerant unix platform. In Proceedings of the 1991 21st International Symposium on Fault-Tolerant Computing (FTCS-21). IEEE, Los Alamitos, CA, 512--519.Google ScholarCross Ref
- Eric Karl, David Blaauw, Dennis Sylvester, and Trevor Mudge. 2006. Reliability modeling and management in dynamic microprocessor-based systems. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 1057--1060. Google ScholarDigital Library
- Amin Khajeh, Minyoung Kim, Nikil Dutt, Ahmed M. Eltawil, and Fadi J. Kurdahi. 2012. Error-aware algorithm/architecture coexploration for video over wireless applications. ACM Transactions on Embedded Computing Systems 11, 1, 15. Google ScholarDigital Library
- Florian Kriebel, Semeen Rehman, Duo Sun, Muhammad Shafique, and Jörg Henkel. 2014. ASER: Adaptive soft error resilience for reliability-heterogeneous processors in the dark silicon era. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article No. 12. Google ScholarDigital Library
- C. Mani Krishna and Kang G. Shin. 1986. On scheduling tasks with a quick recovery from failure. IEEE Transactions on Computers 100, 5, 448--455. Google ScholarDigital Library
- K. J. Kuhn, M. D. Giles, D. Becher, P. Kolar, A. Kornfeld, R. Kotlyar, S. T. Ma, A. Maheshwari, and S. Mudanai. 2011. Process technology variation. IEEE Transactions on Electron Devices 58, 8, 2197--2208.Google ScholarCross Ref
- Chung-Chi Jim Li, Elliot M. Stewart, and W. Kent Fuchs. 1994. Compiler-assisted full checkpointing. Software: Practice and Experience 24, 10, 871--886. Google ScholarDigital Library
- Tuo Li, Muhammad Shafique, Semeen Rehman, Jude Angelo Ambrose, Jörg Henkel, and Sri Parameswaran. 2013a. DHASER: Dynamic heterogeneous adaptation for soft-error resiliency in ASIP-based multi-core systems. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’13). Los Alamitos, CA, 646--653. Google ScholarDigital Library
- Tuo Li, Muhammad Shafique, Semeen Rehman, Swarnalatha Radhakrishnan, Roshan Ragel, Jude Angelo Ambrose, Jörg Henkel, and Sri Parameswaran. 2013b. CSER: HW/SW configurable soft-error resiliency for application specific instruction-set processors. In Proceedings of the 2013 Design, Automation, and Test in Europe Conference and Exhibition (DATE’13). 707--712. Google ScholarDigital Library
- Frank Liberato, Rami Melhem, and Daniel Mossé. 2000. Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems. IEEE Transactions on Computers 49, 9, 906--914. Google ScholarDigital Library
- Jane W. S. Liu, Wei-Kuan Shih, Kwei-Jay Lin, Riccardo Bettati, and Jen-Yao Chung. 1994. Imprecise computations. Proceedings of the IEEE 82, 1, 83--94.Google ScholarCross Ref
- Klaus Lochmann and Andreas Goeb. 2011. A unifying model for software quality. In Proceedings of the 8th International Workshop on Software Quality (WoSQ’11). ACM, New York, NY, 3--10. Google ScholarDigital Library
- Matthias May, Matthias Alles, and Norbert Wehn. 2008. A case study in reliability-aware design: A resilient LDPC code decoder. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’08). ACM, New York, NY, 456--461. Google ScholarDigital Library
- J. W. McPherson. 2006. Reliability challenges for 45Nm and beyond. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 176--181. Google ScholarDigital Library
- A. Meixner and D. J. Sorin. 2008. Detouring: Translating software to circumvent hard faults in simple cores. In Proceedings of the 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN’08). 80--89.Google Scholar
- Jim Mitchell, Daniel Henderson, George Ahrens, and Julissa Villareal. 2009. IBM Power Platform Reliability, Availability and Serviceability (RAS). Technical Report POW03003.doc. International Business Machines Corporation. https//www-304.ibm.com/webapp/set2/sas/f/lopdiags/info/Power6RASOverview.pdf.Google Scholar
- Sparsh Mittal and Jeffrey Vetter. 2015. A survey of techniques for modeling and improving reliability of computing systems. IEEE Transactions on Parallel and Distributed Systems 27, 4, 1226--1238. Google ScholarDigital Library
- Sriram Narayanan, John Sartori, Rakesh Kumar, and Douglas L. Jones. 2010. Scalable stochastic processors. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’10). 335--338. Google ScholarDigital Library
- A. Orailoglu and R. Karri. 1994. Coactive scheduling and checkpoint determination during high level synthesis of self-recovering microarchitectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2, 3, 304--311. Google ScholarDigital Library
- Krishna Palem and Avinash Lingamneni. 2012. What to do about the end of Moore’s law, probably! In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, New York, NY, 924--929. Google ScholarDigital Library
- Mihir Pandya and Miroslaw Malek. 1998. Minimum achievable utilization for fault-tolerant processing of periodic tasks. IEEE Transactions on Computers 47, 10, 1102--1112. Google ScholarDigital Library
- Aashish Pant, Puneet Gupta, and Mihaela Van Der Schaar. 2012. AppAdapt: Opportunistic application adaptation in presence of hardware variation. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 11, 1986--1996. Google ScholarDigital Library
- Matthias Pflanz and Heinrich Theodor Vierhaus. 2001. Online check and recovery techniques for dependable embedded processors. IEEE Micro 5, 24--40. Google ScholarDigital Library
- James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. 1994. Libckpt: Transparent Checkpointing Under Unix. Department of Computer Science, University of Tennessee, Knoxville.Google ScholarDigital Library
- Stefan Poledna. 1996. Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. Kluwer Academic, Norwell, MA. Google ScholarDigital Library
- Stefan Poledna. 2007. System aspects of dependable systems. Lecture Notes on Dependable Computer Systems. https://ti.tuwien.ac.at/cps/teaching/courses/dependable-systems/slides/6a-DCS-system-aspects.pdf.Google Scholar
- Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubhendu S. Mukherjee. 2009. Architectural core salvaging in a multi-core processor for hard-error tolerance. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). 93--104. Google ScholarDigital Library
- Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 111--122. Google ScholarDigital Library
- Abbas Rahimi, Andrea Marongiu, Paolo Burgio, Rajesh K. Gupta, and Luca Benini. 2013. Variation-tolerant OpenMP tasking on tightly-coupled processor clusters. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’13). IEEE, Los Alamitos, CA, 541--546. Google ScholarDigital Library
- Balkrishna Ramkumar and Volker Strumpen. 1997. Portable checkpointing for heterogeneous architectures. In Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing (FTCS-27). IEEE, Los Alamitos, CA, 58--67. Google ScholarDigital Library
- B. Randell, P. Lee, and P. C. Treleaven. 1978. Reliability issues in computing system design. ACM Computing Surveys 10, 2, 123--165. Google ScholarDigital Library
- Joydeep Ray, James C. Hoe, and Babak Falsafi. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 214--224. Google ScholarDigital Library
- Vijay Janapa Reddi, David Z. Pan, Sani R. Nassif, and Keith A. Bowman. 2012. Robust and resilient designs from the bottom-up: Technology, CAD, circuit, and system issues. In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). IEEE, Los Alamitos, CA, 7--16.Google Scholar
- Semeen Rehman, Florian Kriebel, Duo Sun, Muhammad Shafique, and Jörg Henkel. 2014. dTune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article No. 84. Google ScholarDigital Library
- Semeen Rehman, Muhammad Shafique, Pau Vilimelis Aceituno, Florian Kriebel, Jian-Jia Chen, and Jörg Henkel. 2013. Leveraging variable function resilience for selective software reliability on unreliable hardware. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’13). 1759--1764. Google ScholarDigital Library
- Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jörg Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. In Proceedings of the 7th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, NY, 237--246. Google ScholarDigital Library
- Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jörg Henkel. 2012. Raise: Reliability-aware instruction scheduling for unreliable hardware. In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). IEEE, Los Alamitos, CA, 671--676.Google ScholarCross Ref
- George A. Reis, Jonathan Chang, and David I. August. 2007. Automatic instruction-level software-only recovery. IEEE Micro 1, 36--47. Google ScholarDigital Library
- Jude A. Rivers, Meeta S. Gupta, Jeonghee Shin, Prabhakar N. Kudva, and Pradip Bose. 2011. Error tolerance in server class processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 7, 945--959. Google ScholarDigital Library
- Dimitrios Rodopoulos, Georgia Psychou, Mohamed M. Sabry, Francky Catthoor, Antonis Papanikolaou, Dimitrios Soudris, Tobias G. Noll, and David Atienza. 2015. Classification framework for analysis and modeling of physically induced reliability violations. ACM Computing Surveys 47, 3, Article No. 38. Google ScholarDigital Library
- Bogdan F. Romanescu and Daniel J. Sorin. 2008. Core cannibalization architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 43--51. Google ScholarDigital Library
- Tajana Simunic Rosing, Kresimir Mihic, and Giovanni De Micheli. 2007. Power and reliability management of SoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15, 4, 391--403. Google ScholarDigital Library
- E. Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing. 84--91. Google ScholarDigital Library
- Goutam Kumar Saha. 2006. Software based fault tolerance: A survey. Ubiquity 2006, Article No. 1. Google ScholarDigital Library
- Adrian Sampson, James Bornholt, and Luis Ceze. 2015. Hardware-software co-design: Not just a cliché. In LIPIcs-Leibniz International Proceedings in Informatics, Vol. 32. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google Scholar
- Jose Carlos Sancho, Fabrizio Petrini, Kei Davis, Roberto Gioiosa, and Song Jiang. 2005. Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos, CA, 8. Google ScholarDigital Library
- Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA challenges in the dark silicon era. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). IEEE, Los Alamitos, CA, 1--6. Google ScholarDigital Library
- Jeonghee Shin, Victor Zyuban, Pradip Bose, and Timothy M. Pinkston. 2008. A proactive wearout recovery approach for exploiting microarchitectural redundancy to extend cache SRAM lifetime. ACM SIGARCH Computer Architecture News 36, 353--362. Google ScholarDigital Library
- Philip P. Shirvani, Nirmal R. Saxena, and Edward J. McCluskey. 2000. Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability 49, 3, 273--284.Google ScholarCross Ref
- D. P. Siewiorek. 1990. Fault tolerance in commercial computers. Computer 23, 7, 26--37. Google ScholarDigital Library
- D. Siewiorek and R. Swarz. 1982. The Theory and Practice of Reliable System Design. Digital Press.Google Scholar
- Joseph Slember and Priya Narasimhan. 2006. Living with nondeterminism in replicated middleware applications. In Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware. 81--100. Google ScholarDigital Library
- J. Hamilton Slye and Elmootazbellah Nabil Elnozahy. 1996. Supporting nondeterministic execution in fault-tolerant systems. In Proceedings of the IEEE Annual Symposium on Fault-Tolerant Computing. 250--259. Google ScholarDigital Library
- D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 2002 29th Annual International Symposium on Computer Architecture. 123--134. Google ScholarDigital Library
- Brinkley Sprunt, Lui Sha, and John Lehoczky. 1989. Scheduling Sporadic and Aperiodic Events in a Hard Real-Time System. Technical Report. DTIC Document.Google Scholar
- Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. 2005. Exploiting Structural duplication for lifetime reliability enhancement. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). IEEE, Los Alamitos, CA, 520--531. Google ScholarDigital Library
- Hongbin Sun, Pengju Ren, Nanning Zheng, Tong Zhang, and Tao Li. 2011. Architecting high-performance energy-efficient soft error resilient cache under 3D integration technology. Microprocessors and Microsystems 35, 4, 371--381. Google ScholarDigital Library
- Karthik Sundaramoorthy, Zach Purser, and Eric Rotenburg. 2000. Slipstream processors: Improving both performance and fault tolerance. ACM SIGARCH Computer Architecture News 28, 257--268. Google ScholarDigital Library
- Sumant Tambe. 2010. Model-Driven Fault-Tolerance Provisioning for Component-Based Distributed Real-Time Embedded Systems. Ph.D. Dissertation. Vanderbilt University, Nashville, TN. Google ScholarDigital Library
- Olivier Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. ACM SIGARCH Computer Architecture News 40, 3, 356--367. Google ScholarDigital Library
- James E. Tomayko. 1986. Lessons Learned in Creating Spacecraft Computer Systems: Implications for Using Ada for the Space Station. Technical Report. Software Engineering Institute, Carnegie-Mellon University, Pittsburgh, PA.Google Scholar
- Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. ACM SIGARCH Computer Architecture News 24, 191--202. Google ScholarDigital Library
- Shyamsundar Venkataraman, Rui Santos, Akash Kumar, and Jasper Kuijsten. 2015. Hardware task migration module for improved fault tolerance and predictability. In Proceedings of the 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’15). IEEE, Los Alamitos, CA, 197--202.Google ScholarCross Ref
- John Von Neumann. 1956. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata Studies 34, 43--98.Google Scholar
- Nicholas J. Wang and Sanjay J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3, 188--201. Google ScholarDigital Library
- Kun-Lung Wu and W. Kent Fuchs. 1990. Recoverable distributed shared virtual memory. IEEE Transactions on Computers 39, 4, 460--469. Google ScholarDigital Library
- Kun-Lung Wu, W. Kent Fuchs, and Janak H. Patel. 1990. Error recovery in shared memory multiprocessors using private caches. IEEE Transactions on Parallel and Distributed Systems 1, 2, 231--240. Google ScholarDigital Library
- Jun Yan and Wei Zhang. 2005. Compiler-guided register reliability improvement against soft errors. In Proceedings of the 5th ACM International Conference on Embedded Software. ACM, New York, NY, 203--209. Google ScholarDigital Library
Index Terms
- Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems
Recommendations
Soft core based embedded systems in critical aerospace applications
There is an increasing interest in the aerospace industry to reduce the cost of the systems by means of using Commercial Off The Shelf (COTS) devices. The engineering of novel microsatellites and nanosatellites are clear examples of this new trend. ...
Hypergraph-Cover Diversity for Maximally-Resilient Reconfigurable Systems
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and SystemsScaling trends of reconfigurable hardware (RH) and their design flexibility have proliferated their use in dependability-critical embedded applications. Although their reconfigurability can enable significant fault tolerance, due to the complexity of ...
Understanding and Mitigating Hardware Failures in Deep Learning Training Systems
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer ArchitectureDeep neural network (DNN) training workloads are increasingly susceptible to hardware failures in datacenters. For example, Google experienced "mysterious, difficult to identify problems" in their TPU training systems due to hardware failures [7]. ...
Comments