skip to main content
article

The Case for Lifetime Reliability-Aware Microprocessors

Published:02 March 2004Publication History
Skip Abstract Section

Abstract

Ensuring long processor lifetimes by limiting failuresdue to wear-out related hard errors is a critical requirementfor all microprocessor manufacturers. We observethat continuous device scaling and increasing temperaturesare making lifetime reliability targets even harder to meet.However, current methodologies for qualifying lifetime reliabilityare overly conservative since they assume worst-caseoperating conditions. This paper makes the case thatthe continued use of such methodologies will significantlyand unnecessarily constrain performance. Instead, lifetimereliability awareness at the microarchitectural design stagecan mitigate this problem, by designing processors that dynamicallyadapt in response to the observed usage to meeta reliability target.We make two specific contributions. First, we describean architecture-level model and its implementation, calledRAMP, that can dynamically track lifetime reliability, respondingto changes in application behavior. RAMP isbased on state-of-the-art device models for different wear-outmechanisms. Second, we propose dynamic reliabilitymanagement (DRM) - a technique where the processorcan respond to changing application behavior to maintainits lifetime reliability target. In contrast to currentworst-case behavior based reliability qualification methodologies,DRM allows processors to be qualified for reliabilityat lower (but more likely) operating points than theworst case. Using RAMP, we show that this can save costand/or improve performance, that dynamic voltage scalingis an effective response technique for DRM, and that dynamicthermal management neither subsumes nor is sub-sumedby DRM.

References

  1. {1} Reliability in CMOS IC Design: Physical Failure Mechanisms and their Modeling. In MOSIS Technical Notes, http://www.mosis.org/support/technical-notes.html.Google ScholarGoogle Scholar
  2. {2} Failure Mechanisms and Models for Semiconductor Devices. In JEDEC Publication JEP 122-A, 2002.Google ScholarGoogle Scholar
  3. {3} Critical Reliability Challenges for The International Technology Roadmap for Semiconductors. In Intl. Sematech Tech. Transfer 03024377A-TR, 2003.Google ScholarGoogle Scholar
  4. {4} D. H. Albonesi et al. Dynamically Tuning Processor Resources with Adaptive Processing. In IEEE Computer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. {5} T. M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proc. of the 32nd Annual Intl. Symp. on Microarchitecture, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {6} P. Bose. Power-Efficient Microarchitectural Choices at the Early Design Stage. In Keynote Address, Workshop on Power-Aware Computer Systems, 2003.Google ScholarGoogle Scholar
  7. {7} D. Brooks et al. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In Proc. of the 27th Annual Intl. Symp. on Comp. Arch., 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. {8} A. Dasgupta et al. Electromigration Reliability Enhancement Via Bus Activity Distribution. In Design Automation Conference, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. {9} S. Heo et al. Reducing Power Density Through Activity Migration. In Intl. Symp. on Low Power Elec. Design, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {10} C. J. Hughes et al. RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors. IEEE Computer, Feb. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. {11} S. S. Mukherjee et al. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In Proc. of the 36th Intl. Symp. on Microarch., 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {12} D. Patterson et al. Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. In UC Berkeley CS Tech. Report UCB//SD-02-1175, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. {13} M. G. Pecht et al. Guidebook for Managing Silicon Chip Reliabilty. CRC Press, 1999.Google ScholarGoogle Scholar
  14. {14} E. Rotenberg. AR/SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In International Symposium on Fault Tolerant Computing, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {15} R. Sasanka et al. Joint Local and Global Hardware Adaptations for Energy. In Proc. of the 10th Intl. Conf. on Arch. Support for Prog. Langs. and Operating Sys., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. {16} K. Seshan et al. The Quality and Reliability of Intel's Quarter Micron Process. In Intel Technology Journal, Q3, 1998.Google ScholarGoogle Scholar
  17. {17} P. Shivakumar et al. Exploiting Microarchitectural Redundancy for Defect Tolerance. In 21st Intl. Conf. on Comp. Design, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. {18} K. Skadron et al. Temperature-Aware Microarchitecture. In Proc. of the 30th Annual Intl. Symp. on Comp. Arch., 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {19} L. Spainhower et al. IBM s/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. In IBM Journal of R&D, September/November 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} J. Srinivasan et al. The Impact of Scaling on Processor Lifetime Reliability. In Proc. of the Intl. Conf. on Dependable Systems and Networks, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. {21} J. H. Stathis. Reliability Limits for the Gate Insulator in CMOS Technology. In IBM Journal of R&D, Vol. 46, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. {22} K. Trivedi. Probability and Statistics with Reliability, Queueing, and Computer Science Applications. Prentice Hall, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. {23} N. J. Wang et al. Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline. In Proc. of the Intl. Conf. on Dependable Systems and Networks, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {24} E. Y. Wu et al. Interplay of Voltage and Temperature Acceleration of Oxide Breakdown for Ultra-Thin Gate Dioxides. In Solid-state Electronics Journal, 2002.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 32, Issue 2
    ISCA 2004
    March 2004
    373 pages
    ISSN:0163-5964
    DOI:10.1145/1028176
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture
      June 2004
      373 pages
      ISBN:0769521436

    Copyright © 2004 Authors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 2 March 2004

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader