ABSTRACT
Choosing the correct error injection technique is of primary importance in simulation-based design and evaluation of robust systems that are resilient to soft errors. Many low-level (e.g., flip-flop-level) error injection techniques are generally used for small systems due to long execution times and significant memory requirements. High-level error injections at the architecture or memory levels are generally fast but can be inaccurate. Unfortunately, there exists very little research literature on quantitative analysis of the inaccuracies associated with high-level error injection techniques. In this paper, we use simulation and emulation results to understand the accuracy trade-offs associated with a variety of high-level error injection techniques. A detailed analysis of error propagation explains the causes of high degrees of inaccuracies associated with error injection techniques at higher levels of abstraction.
- {Arlat 03} J. Arlat et al., "Comparison of Physical and Software-Implemented Fault Injection Techniques," IEEE Trans. Computers, vol. 52, no. 9, pp. 1115--1133, Sept. 2003. Google ScholarDigital Library
- {Borkar 11} S. Borkar and A. A. Chien, "The Future of Microprocessors," Commun. ACM, vol. 54, no. 5, pp. 67--77, May 2011. Google ScholarDigital Library
- {Chen 06} G. Chen, G. Chen, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin, "Object Duplication for Improving Reliability," Proc. Asia and South Pacific Design Automation Conf., pp. 140--145, 2006. Google ScholarDigital Library
- {Chen 08} D. Chen, G. Jacques-Silva, Z. Kalbarczyk, R. K. Iyer, and B. Mealey, "Error Behavior Comparison of Multiple Computing Systems: A Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER," Proc. IEEE Pac. Rim Intl. Symp. Dependable Computing, pp. 339--346, 2008. Google ScholarDigital Library
- {Choi 90} G. S. Choi, R. K. Iyer, and V. A. Carreno, "Simulated Fault Injection: A Methodology to Evaluate Fault Tolerant Microprocessor Architectures," IEEE Trans. Reliability, vol. 39, no. 4, pp. 486--491, Oct. 1990.Google ScholarCross Ref
- {Davis 09} J. D. Davis, C. P. Thacker, and C. Chang, "BEE3: Revitalizing Computer Architecture Research," Microsoft Tech. Rep. MSR-TR-2009-45, 2009.Google Scholar
- {DeHon 10} A. DeHon, H. M. Quinn, and N. P. Carter, "Vision for Cross-Layer Optimization to Address the Dual Challenges of Energy and Reliability," Proc. Design, Automation and Test in Europe, pp.1017--1022, 2010. Google ScholarDigital Library
- {Feng 10} S. Feng, S. Gupta, A. Ansari, and S. Mahlke, "Shoestring: Probabilistic Soft Error Reliability on the Cheap," Proc. Intl. Conf. Architectural Support for Programming Languages and Operating Systems, pp. 385--396, 2010. Google ScholarDigital Library
- {Fleming 86} P. J. Fleming and J. J. Wallace, "How not to lie with statistics: the correct way to summarize benchmark results," Commun. ACM, vol. 29, no. 3, pp. 218--221, March 1986. Google ScholarDigital Library
- {Gem5} "The gem5 Simulator System," http://www.m5sim.orgGoogle Scholar
- {Gu 04} W. Gu, Z. Kalbarczyk, R. K. Iyer, "Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors," Proc. Intl. Conf. on Dependable Systems and Networks, pp. 887--896, 2004. Google ScholarDigital Library
- {Howard 10} J. Howard et al., "A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS," Proc. IEEE Intl. Solid-State Circuits Conf., pp. 108--109, 2010.Google Scholar
- {Kalbarczyk 99} Z. Kalbarczyk et al., "Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation," IEEE Trans. Software Engineering, vol. 25, no. 5, pp. 619--632, Sept.--Oct. 1999. Google ScholarDigital Library
- {Kanawati 93} G. A. Kanawati, N. A. Kanawati, and J. A. Abraham, "EMAX: An Automatic Extractor of High-Level Error Models," Proc. AIAA Computing Aerospace Conf., pp. 1297--1306, 1993.Google ScholarCross Ref
- {KleinOsowski 02} AJ KleinOsowski, D. J. Lilja, "MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research," IEEE Computer Architecture Letters, vol. 1, no. 1, p. 7, Jan.--Dec. 2002. Google ScholarDigital Library
- {Leon} Aeroflex Gaisler, "Leon3 Processor," http://www.gaisler.com.Google Scholar
- {McCluskey 71} E. J. McCluskey and F. W. Clegg, "Fault Equivalence in Combinational Logic Networks," IEEE Trans. Computers, vol. 20, no. 11, pp. 1286--1293, Nov. 1971. Google ScholarDigital Library
- {McCluskey 00} E. J. McCluskey and C.-W. Tseng, "Stuck-Fault Tests vs. Actual Defects," IEEE Intl. Test Conf., pp. 336--343, 2000. Google ScholarDigital Library
- {Maniatakos 11} M. Maniatakos, N. Karimi, C. Tirumurti, A. Jas, and Y. Makris, "Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller," IEEE Trans. Computers, vol. 60, no. 9, pp. 1260--1273, Sept. 2011. Google ScholarDigital Library
- {Michalak 12} S. E. Michalak et al., "Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer," IEEE Trans. Device and Materials Reliability, vol. 12, no. 2, pp. 445--454, June 2012.Google ScholarCross Ref
- {Miskov-Zivanov 10} N. Miskov-Zivanov, D. Marculescu, "Multiple Transient Faults in Combinational and Sequential Circuits: A Systematic Approach," IEEE Trans. Comput.-Aided Des. Integr. Circuits and Syst., vol. 29, no. 10, pp. 1614--1627, Oct. 2010. Google ScholarDigital Library
- {Mitra 10} S. Mitra, K. Brelsford, and P. N. Sanda, "Cross-Layer Resilience Challenges: Metrics and Optimization," Proc. Design, Automation and Test in Europe, pp. 1029--1034, 2010. Google ScholarDigital Library
- {OpenSPARC} "OpenSPARC: World's First Free 64-bit Microprocessor," http://www.opensparc.net.Google Scholar
- {Pellegrini 12} A. Pellegrini et al., "CrashTest'ing SWAT: Accurate, Gate-Level Evaluation of Symptom-Based Resiliency Solutions," Proc. Design, Automation and Test in Europe, pp. 1106--1109, 2012. Google ScholarDigital Library
- {Pattabiraman 11} K. Pattabiraman, G. P. Saggese, D. Chen, Z. T. Kalbarczyk, and R. K. Iyer "Automated Derivation of Application-Specific Error Detectors Using Dynamic Analysis," IEEE Trans. Dependable and Secure Computing, vol. 8, no. 5, pp. 640--655, Sept.--Oct. 2011. Google ScholarDigital Library
- {Ramachandran 08} P. Ramachandran, P. Kudva, J. Kellington, J. Schumann, and P. Sanda, "Statistical Fault Injection," Proc. IEEE Intl. Conf. Dependable Systems and Networks, pp. 122--127, 2008.Google Scholar
- {Racunas 07} P. Racunas, K. Constantinides, S. Manne, and S. S. Mukherjee, "Perturbation-based Fault Screening," Proc. IEEE Intl. Symp. High Performance Computer Architecture, pp. 169--180, 2007. Google ScholarDigital Library
- {Rebaudengo 02} M. Rebaudengo, M. S. Reorda, and M. Violante, "Analysis of SEU effects in a pipelined processor," Proc. IEEE Intl. On-Line Testing Workshop, pp. 112--116, 2002. Google ScholarDigital Library
- {Rimen 94} M. Rimen, J. Ohlsson, and J. Torin, "On microprocessor error behavior modeling," Proc. IEEE Intl. Symp. Fault-Tolerant Computing, pp. 76--85, 1994.Google ScholarCross Ref
- {Sanda 08} P. N. Sanda et al., "Soft-error resilience of the IBM POWER6 processor," IBM Journal of Research and Development, vol. 52, no. 3, pp. 275--284, May 2008. Google ScholarDigital Library
- {Seifert 10} N. Seifert, "Radiation-induced soft errors: A chip-level modeling per- spective," Foundat. Trends® in Electron. Design Autom., vol. 4, no. 2-3, pp. 99--221, Feb. 2010. Google ScholarDigital Library
- {Seifert 12} N. Seifert et al., "Soft Error Susceptibilities of 22 nm Tri-Gate Devices," IEEE Trans. Nucl. Sci., vol. 59, no. 6, pp. 2666--2673, Dec. 2012.Google ScholarCross Ref
- {Wang 04} N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel, "Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline," Proc. Intl. Conf. on Dependable Systems and Networks, pp. 61--70, 2004. Google ScholarDigital Library
- {Wang 07} N. J. Wang, A. Mahesri, and S. J. Patel, "Examining ACE Analysis Reliability Estimates Using Fault-Injection," Proc. Intl. Symp. Computer Architecture, pp. 460--469, 2007. Google ScholarDigital Library
- {Yim 10} K. S. Yim, Z. Kalbarczyk, and R. K. Iyer, "Measurement-based Analysis of Fault and Error Sensitivities of Dynamic Memory," Proc. IEEE/IFIP Intl. Conf. on Dependable Systems and Networks, pp. 431--436, 2010.Google Scholar
- {Zhang 10} Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August, "DAFT: Decoupled Acyclic Fault Tolerance," Proc. Intl. Conf. Parallel Architectures and Compilation Techniques, pp. 87--98, 2010. Google ScholarDigital Library
Index Terms
- Quantitative evaluation of soft error injection techniques for robust system design
Recommendations
Error injection-based study of soft error propagation in AMD Bulldozer microprocessor module
DSN '12: Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)Single-event upsets (SEU) and single-event transients (SET) may lead to crashes or even silent data corruption (SDC) in microprocessors. Error detection and recovery features are employed to mitigate the impact of SEU and SET. However, these features ...
Design of Soft Error Resilient Linear Digital Filters Using Checksum-Based Probabilistic Error Correction
VTS '06: Proceedings of the 24th IEEE VLSI Test SymposiumAny error detecting or correcting code must meet specific code distance criteria to be able to detect and correct specified numbers of errors. Prior work in the area of error detection and correction in linear digital systems using real number checksum ...
Comments