ABSTRACT
Transient faults due to particle strikes are a key challenge in microprocessor design. Driven by exponentially increasing transistor counts, per-chip faults are a growing burden. To protect against soft errors, redundancy techniques such as redundant multithreading (RMT) are often used. However, these techniques assume that the probability that a structural fault will result in a soft error (i.e., the Architectural Vulnerability Factor (AVF)) is 100 percent, unnecessarily draining processor resources. Due to the high cost of redundancy, there have been efforts to throttle RMT at runtime. To date, these methods have not incorporated an AVF model and therefore tend to be ad hoc. Unfortunately, computing the AVF of complex microprocessor structures (e.g., the ISQ) can be quite involved.
To provide probabilistic guarantees about fault tolerance, we have created a rigorous characterization of AVF behavior that can be easily implemented in hardware. We experimentally demonstrate AVF variability within and across the SPEC2000 benchmarks and identify strong correlations between structural AVF values and a small set of processor metrics. Using these simple indicators as predictors, we create a proof-of-concept RMT implementation that demonstrates that AVF prediction can be used to maintain a low fault tolerance level without significant performance impact.
- D. Bernick and et al. NonStop® Advanced Architecture. In Proceedings of the InternationalConference on Dependable Systems and Networks (DSN), pages 12--21, June 2005. Google ScholarDigital Library
- A. Biswas, P. Racunas, R. Cheveresan, J. S. Emer, S. S. Mukherjee, and R. Rangan. Computing architectural vulnerability factors for address-based structures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 532--543, 2005. Google ScholarDigital Library
- D. Burger and T. Austin. The SimpleScalar Toolset, Version 3.0. http://www.simplescalar.com.Google Scholar
- C.L. Chen and M.Y. Hsiao. Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review. IBM Journal of Research and Development, 28(2):124--134, March 1984.Google ScholarDigital Library
- E. Duesterwald, C. Cascaval, and S. Dwarkadas. Characterizing and predicting program behavior and its variability. In PACT '03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, page 220, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- L. Eeckhout, H. Vandierendonck, and K. D. Bosschere. Workload design: Selecting representative program-input pairs. In PACT '02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 83--94, Washington, DC, USA, 2002. IEEE Computer Society. Google ScholarDigital Library
- X. Fu, J. Poe, T. Li, and J. Fortes. Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior. In Proceedings of the International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), September 2006. Google ScholarDigital Library
- M. Gomaa, C. Scarbrough, T. Vijaykumar, and I. Pomeranz. Transient-Fault Recovery for Chip Multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 98--109, June 2003. Google ScholarDigital Library
- M. A. Gomaa and T. N. Vijaykumar. Opportunistic transient-fault detection. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 172--183, 2005. Google ScholarDigital Library
- D. Grunwald, A. Klauser, S. Manne, and A. R. Pleszkun. Confidence estimation for speculation control. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 122--131, 1998. Google ScholarDigital Library
- K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and K. D. Bosschere. Performance prediction based on inherent program similarity. In PACT '06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 114--122, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- I. Jolliffe. Principal Component Analysis. Springer, 2002.Google Scholar
- S. Kumar and A. Aggarwal. Reduced Resource Redundancy for Concurrent Error Detection Techniques in High Performance Microprocessors. In Proceedings of the International Conference on High Performance Computer Architecture (HPCA), pages 212--221, February 2006.Google Scholar
- N. Madan and R. Balasubramonian. A First-Order Analysis of Power Overheads of Redundant Multi-Threading. In Proceedings of the Workshop on the System Effects of Logic Soft Errors (SELSE), April 2006.Google Scholar
- S. Mukherjee, M. Kontz, and S. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In International Symposium on Computer Architecture (ISCA), pages 99--110, May 2002. Google ScholarDigital Library
- S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 29--40, December 2003. Google ScholarDigital Library
- Multiple SimPoints. http://www.cse.ucsd.edu/~calder/simpoint/multiplestandardsimpoints.htm.Google Scholar
- A. Parashar, S. Gurumurthi, and A. Sivasubramaniam. A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 376--386, June 2004. Google ScholarDigital Library
- A. Parashar, S. Gurumurthi, and A. Sivasubramaniam. SlicK: Slice-based Locality Exploitation for Efficient Redundant Multithreading. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 95--105, October 2006. Google ScholarDigital Library
- M. Rashid, E. Tan, M. Huang, and D. Albonesi. Exploiting Coarse-Grained Verification Parallelism for Power-Efficient Fault Tolerance. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 315--325, September 2005. Google ScholarDigital Library
- V. Reddy, S. Parthasarathy, and E. Rotenberg. Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 83--94, October 2006. Google ScholarDigital Library
- S. Reinhardt and S. Mukherjee. Transient Fault Detection via Simultaneous Multithreading. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 25--36, June 2000. Google ScholarDigital Library
- G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August. SWIFT: Software Implemented Fault Tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), March 2005. Google ScholarDigital Library
- G. Reis, J. Chang, N. Vachharajani, R. Rangan, D. August, and S. Mukherjee. Design and Evaluation of Hybrid Fault-Detection Systems. In Proceedings of the International Symposium on Computer Architecture (ISCA), June 2005. Google ScholarDigital Library
- E. Rotenberg. AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In proceedings of the International Symposium on Fault-Tolerant Computing (FTCS), pages 84--91, June 1999. Google ScholarDigital Library
- J. Sheaffer, D. Luebke, and K. Skadron. The visual vulnerability spectrum: Characterizing architectural vulnerability for graphics hardware. In Proceedings of the 2006 Graphics Hardware Workshop, 2006. Google ScholarDigital Library
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002. Google ScholarDigital Library
- P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling the Effect of Technology Trends on Soft Error Rate of Combinational Logic. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 2002. Google ScholarDigital Library
- T. J. Slegel, R. M. A. III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. Ibm's s/390 g5 microprocessor design. IEEE Micro, 19(2):12--23, 1999. Google ScholarDigital Library
- J. Smolens, B. Gold, J. Kim, B. Falsafi, J. Hoe, and A. Nowatzyk. Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 224--234, October 2004. Google ScholarDigital Library
- J. Smolens, J. Kim, J. Hoe, and B. Falsafi. Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 257--268, December 2004. Google ScholarDigital Library
- SPEC CPU2000. http://www.spec.org/cpu2000/.Google Scholar
- T. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-Fault Recovery via Simultaneous Multithreading. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 87--98, May 2002. Google ScholarDigital Library
- A. Wood. Data integrity concepts, features, and technology. White Paper, Tandem Division, Compaq Computer Corporation.Google Scholar
- J. Zeigler. Terrestrial Cosmic Rays. IBM Journal of Research and Development, 40(1):19--39, January 1996. Google ScholarDigital Library
Index Terms
- Dynamic prediction of architectural vulnerability from microarchitectural state
Recommendations
Dynamic prediction of architectural vulnerability from microarchitectural state
Transient faults due to particle strikes are a key challenge in microprocessor design. Driven by exponentially increasing transistor counts, per-chip faults are a growing burden. To protect against soft errors, redundancy techniques such as redundant ...
Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures
PRDC '09: Proceedings of the 2009 15th IEEE Pacific Rim International Symposium on Dependable ComputingSoft Errors have emerged as a key challenge to microprocessor design. Traditional soft error tolerance techniques (such as redundant multithreading and instruction duplication) can achieve high fault coverage but at the cost of significant performance ...
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
Proceedings of the 2006 ASPLOS ConferenceRedundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance ...
Comments