ABSTRACT
Designing dependable on-chip manycore systems is subjected to consideration of multiple reliability threats, i.e. soft errors, aging, process variation, etc. In this paper, we introduce a novel adaptive Dependability Tuning (dTune) scheme for many-core processors. It leverages the knowledge of varying vulnerability and error masking properties of different applications along with multiple compiled versions (each offering distinct reliability and performance properties). Our dTune system dynamically tunes the dependability mode at the hardware level through hybrid Redundant Multithreading tuning and at the software level through selection of reliable code version under given performance constraints. It jointly accounts for soft errors and cores' performance variations due to design-time process variation and/or run-time aging-induced performance degradation. We compare our dTune system with four different state-of-the-art techniques and achieve on average 44% and up to 63% improved task reliability for different chip configurations, different variability maps, and different aging years.
- J. Henkel, L. Bauer, N. Dutt, P. Gupta, S. Nassif, M. Shafique, M. Tahoori, N. Wehn, "Reliable On-Chip Systems in the Nano-Era: Lessons Learnt and Future Trends", IEEE Design Automation Conference (DAC), 2013. Google ScholarDigital Library
- Int'l technology roadmap for semiconductors, 2009.Google Scholar
- S. Herbert, S. Garg, D. Marculescu, "Exploiting process variability in voltage/frequency control", IEEE TVLSI, 20(8):1392--1404, 2012. Google ScholarDigital Library
- B. Raghunathan, Y. Turakhia, S. Garg, D. Marculescu, "Cherry-Picking: Exploiting Process Variations in Dark-Silicon Homogeneous Chip Multi-Processors", IEEE DATE, pp. 39--44, 2013. Google ScholarDigital Library
- S. Herbert, D. Marculescu, "Characterizing chip-multiprocessor variability-tolerance", IEEE DAC, pp. 313--318, 2008. Google ScholarDigital Library
- K K.Rangan, M. Powell, G.-Y. Wei, D. Brooks, "Achieving Uniform Performance and Maximizing Throughput in the Presence of Heterogeneity", IEEE HPCA, pp. 3--14, 2011. Google ScholarDigital Library
- L. Lin, Wayne Burleson, "Analysis and Mitigation of Process Variation Impacts on Power-Attack Tolerance", IEEE DAC, pp. 238--243, 2009. Google ScholarDigital Library
- K.A. Bowman, S.G. Duvall, J.D. Meindl, "Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration", IEEE Journal of Solid-State Circuits, 37(2):183--190, 2002.Google ScholarCross Ref
- R. Zheng et al., "Circuit Aging Prediction for Low Power Operation", CICC, pp. 427--430, 2009.Google Scholar
- J. Henkel, T. Ebi, H. Amrouch, H. Khdr, "Thermal management for dependable on-chip systems", ASP-DAC, pp. 113--118, 2013.Google Scholar
- I. Kadayif, M. Kandemir, I. Kolcu, "Exploiting processor workload heterogeneity for reducing energy consumption in chip multiprocessors", IEEE DATE, pp. 1158--1163, 2004. Google ScholarDigital Library
- K. Srinivasan, K. S. Chatha, "Integer linear programming and heuristic techniques for system-level low power scheduling on multiprocessor architectures under throughput constraints", Integration VLSI, vol. 40, no.3, 2007. Google ScholarDigital Library
- A. Masrur, et al., "Schedulability Analysis for Processors with Aging-Aware Automatic Frequency Scaling", IEEE (RTCSA), 2012. Google ScholarDigital Library
- J. B. Velamala, K. Sutaria, T. Sato, Y. Cao, "Physics matters: statistical aging prediction under trapping/detrapping", IEEE DAC, pp. 139--144, 2012. Google ScholarDigital Library
- R. Vadlamani, J. Zhao, W. Burleson, R. Tessier, "Multicore soft error rate stabilization using adaptive dual modular redundancy", IEEE DATE, pp. 27--32, 2010. Google ScholarDigital Library
- J. Hu, S. Wang, S. G. Ziavras, "In-Register Duplication: Exploiting Narrow-Width Value for Improving Register File Reliability," IEEE DSN, pp. 281--290, 2006. Google ScholarDigital Library
- M. Shafique, S. Rehman, P. V. Aceituno, J. Henkel, "Exploiting Program-Level Masking and Error Propagation for Constrained Reliability Optimization", IEEE DAC, 2013. Google ScholarDigital Library
- S. Rehman, M. Shafique, P. V. Aceituno, F. Kriebel, J.-J. Chen, J. Henkel, "Leveraging Variable Function Resilience for Selective Software Reliability on Unreliable Hardware", IEEE DATE, pp. 1759--1764, 2013. Google ScholarDigital Library
- S. Rehman, M. Shafique, F. Kriebel, J. Henkel, "Reliable software for unreliable hardware: Embedded code generation aiming at reliability", IEEE Codess+ISSS, pp. 237--246, 2011. Google ScholarDigital Library
- S. Rehman, M. Shafique, J. Henkel, "Instruction Scheduling for Reliability-Aware Compilation", IEEE DAC, pp. 1288--1296, 2012. Google ScholarDigital Library
- S. Rehman. A. Toma, F. Kriebel, M. Shafique, J.-J. Chen, J. Henkel, "Reliable Code Generation and Execution on Unreliable Hardware under Joint Functional and Timing Reliability Considerations", IEEE RTAS, pp. 273--282, 2013. Google ScholarDigital Library
- J. C. Smolens, B. T. Gold, B. Falsafi, J. C. Hoe, "Reunion: complexity-effective multicore redundancy", IEEE MICRO, pp. 223--234, 2006. Google ScholarDigital Library
- A.Shye, T. Moseley, V. Janapa Reddi, J. Blomstedt, D. A. Connors, "Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance", IEEE DSN, pp. 297--306, 2007. Google ScholarDigital Library
- T. Li, M. Shafique, S. Rehman, J. A. Ambrose, J. Henkel, S. Parameswaran, "DHASER: Dynamic Heterogeneous Adaptation for Soft-Error Resiliency in ASIP-based Multi-core Systems", IEEE ICCAD, pp. 646--653, 2013. Google ScholarDigital Library
- S K. Reinhardt, S. S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", IEEE ISCA, pp. 25--34, 2000. Google ScholarDigital Library
- S. S. Mukherjee, M. Kontz, S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives", IEEE ISCA, pp. 99--110, 2002. Google ScholarDigital Library
- S. Rehman M. Shafique, F. Kriebel, J. Henkel, "Compiler-Driven Dynamic Reliability Management for On-Chip Systems under Variabilities", IEEE DATE, 2014. Google ScholarDigital Library
- J. Xiong, V. Zolotov, L. He, "Robust extraction of spatial correlation", IEEE TCAD, 26(4):619--631, 2007. Google ScholarDigital Library
- M.A. Alam, S. Mahapatra, "A comprehensive model of pmos nbti degradation", Microelectronics Reliability, pp. 71--81, 2005.Google ScholarCross Ref
- Flux calculator: www.seutest.com/cgi-bin/FluxCalculator.cgi.Google Scholar
- CES Aging Estimation Tools: http://ces.itec.kit.edu/download/Google Scholar
- dTune: Leveraging Reliable Code Generation for Adaptive Dependability Tuning under Process Variation and Aging-Induced Effects
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Comments