Abstract
Ensuring long processor lifetimes by limiting failuresdue to wear-out related hard errors is a critical requirementfor all microprocessor manufacturers. We observethat continuous device scaling and increasing temperaturesare making lifetime reliability targets even harder to meet.However, current methodologies for qualifying lifetime reliabilityare overly conservative since they assume worst-caseoperating conditions. This paper makes the case thatthe continued use of such methodologies will significantlyand unnecessarily constrain performance. Instead, lifetimereliability awareness at the microarchitectural design stagecan mitigate this problem, by designing processors that dynamicallyadapt in response to the observed usage to meeta reliability target.We make two specific contributions. First, we describean architecture-level model and its implementation, calledRAMP, that can dynamically track lifetime reliability, respondingto changes in application behavior. RAMP isbased on state-of-the-art device models for different wear-outmechanisms. Second, we propose dynamic reliabilitymanagement (DRM) - a technique where the processorcan respond to changing application behavior to maintainits lifetime reliability target. In contrast to currentworst-case behavior based reliability qualification methodologies,DRM allows processors to be qualified for reliabilityat lower (but more likely) operating points than theworst case. Using RAMP, we show that this can save costand/or improve performance, that dynamic voltage scalingis an effective response technique for DRM, and that dynamicthermal management neither subsumes nor is sub-sumedby DRM.
- {1} Reliability in CMOS IC Design: Physical Failure Mechanisms and their Modeling. In MOSIS Technical Notes, http://www.mosis.org/support/technical-notes.html.Google Scholar
- {2} Failure Mechanisms and Models for Semiconductor Devices. In JEDEC Publication JEP 122-A, 2002.Google Scholar
- {3} Critical Reliability Challenges for The International Technology Roadmap for Semiconductors. In Intl. Sematech Tech. Transfer 03024377A-TR, 2003.Google Scholar
- {4} D. H. Albonesi et al. Dynamically Tuning Processor Resources with Adaptive Processing. In IEEE Computer, 2003. Google ScholarDigital Library
- {5} T. M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proc. of the 32nd Annual Intl. Symp. on Microarchitecture, 1998. Google ScholarDigital Library
- {6} P. Bose. Power-Efficient Microarchitectural Choices at the Early Design Stage. In Keynote Address, Workshop on Power-Aware Computer Systems, 2003.Google Scholar
- {7} D. Brooks et al. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In Proc. of the 27th Annual Intl. Symp. on Comp. Arch., 2000. Google ScholarDigital Library
- {8} A. Dasgupta et al. Electromigration Reliability Enhancement Via Bus Activity Distribution. In Design Automation Conference, 1996. Google ScholarDigital Library
- {9} S. Heo et al. Reducing Power Density Through Activity Migration. In Intl. Symp. on Low Power Elec. Design, 2003. Google ScholarDigital Library
- {10} C. J. Hughes et al. RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors. IEEE Computer, Feb. 2002. Google ScholarDigital Library
- {11} S. S. Mukherjee et al. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In Proc. of the 36th Intl. Symp. on Microarch., 2003. Google ScholarDigital Library
- {12} D. Patterson et al. Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. In UC Berkeley CS Tech. Report UCB//SD-02-1175, 2002. Google ScholarDigital Library
- {13} M. G. Pecht et al. Guidebook for Managing Silicon Chip Reliabilty. CRC Press, 1999.Google Scholar
- {14} E. Rotenberg. AR/SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In International Symposium on Fault Tolerant Computing, 1998. Google ScholarDigital Library
- {15} R. Sasanka et al. Joint Local and Global Hardware Adaptations for Energy. In Proc. of the 10th Intl. Conf. on Arch. Support for Prog. Langs. and Operating Sys., 2002. Google ScholarDigital Library
- {16} K. Seshan et al. The Quality and Reliability of Intel's Quarter Micron Process. In Intel Technology Journal, Q3, 1998.Google Scholar
- {17} P. Shivakumar et al. Exploiting Microarchitectural Redundancy for Defect Tolerance. In 21st Intl. Conf. on Comp. Design, 2003. Google ScholarDigital Library
- {18} K. Skadron et al. Temperature-Aware Microarchitecture. In Proc. of the 30th Annual Intl. Symp. on Comp. Arch., 2003. Google ScholarDigital Library
- {19} L. Spainhower et al. IBM s/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. In IBM Journal of R&D, September/November 1999. Google ScholarDigital Library
- {20} J. Srinivasan et al. The Impact of Scaling on Processor Lifetime Reliability. In Proc. of the Intl. Conf. on Dependable Systems and Networks, 2004. Google ScholarDigital Library
- {21} J. H. Stathis. Reliability Limits for the Gate Insulator in CMOS Technology. In IBM Journal of R&D, Vol. 46, 2002. Google ScholarDigital Library
- {22} K. Trivedi. Probability and Statistics with Reliability, Queueing, and Computer Science Applications. Prentice Hall, 1982. Google ScholarDigital Library
- {23} N. J. Wang et al. Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline. In Proc. of the Intl. Conf. on Dependable Systems and Networks, 2004. Google ScholarDigital Library
- {24} E. Y. Wu et al. Interplay of Voltage and Temperature Acceleration of Oxide Breakdown for Ultra-Thin Gate Dioxides. In Solid-state Electronics Journal, 2002.Google Scholar
Recommendations
Lifetime Reliability Enhancement of Microprocessors: Mitigating the Impact of Negative Bias Temperature Instability
Ensuring lifetime reliability of microprocessors has become more critical. Continuous scaling and increasing temperatures due to growing power density are threatening lifetime reliability. Negative bias temperature instability (NBTI) has been known for ...
The Case for Lifetime Reliability-Aware Microprocessors
ISCA '04: Proceedings of the 31st annual international symposium on Computer architectureEnsuring long processor lifetimes by limiting failuresdue to wear-out related hard errors is a critical requirementfor all microprocessor manufacturers. We observethat continuous device scaling and increasing temperaturesare making lifetime reliability ...
Comments