skip to main content
research-article

Methods for fault tolerance in networks-on-chip

Published:11 July 2013Publication History
Skip Abstract Section

Abstract

Networks-on-Chip constitute the interconnection architecture of future, massively parallel multiprocessors that assemble hundreds to thousands of processing cores on a single chip. Their integration is enabled by ongoing miniaturization of chip manufacturing technologies following Moore's Law. It comes with the downside of the circuit elements' increased susceptibility to failure. Research on fault-tolerant Networks-on-Chip tries to mitigate partial failure and its effect on network performance and reliability by exploiting various forms of redundancy at the suitable network layers. The article at hand reviews the failure mechanisms, fault models, diagnosis techniques, and fault-tolerance methods in on-chip networks, and surveys and summarizes the research of the last ten years. It is structured along three communication layers: the data link, the network, and the transport layers. The most important results are summarized and open research problems and challenges are highlighted to guide future research on this topic.

References

  1. Agarwal, M., Paul, B., Zhang, M., and Mitra, S. 2007. Circuit failure prediction and its application to transistor aging. In Proceedings of the 25th IEEE VLSI Test Symposium. 277--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aisopos, K., Chen, C.-H., and Peh, L.-S. 2011a. Enabling system-level modeling of variation-induced faults in networks-on-chips. In Proceedings of the 48th ACM/EDAC/IEEE Design Automation Conference (DAC'11). 930--935. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Aisopos, K., Deorio, A., Peh, L.-S., and Bertacco, V. 2011b. Ariadne: Agnostic reconfiguration in a disconnected network environment. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'11). 298--309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alaghi, A., Karimi, N., Sedghi, M., and Navabi, Z. 2007. Online noc switch fault detection and diagnosis using a high level fault model. In Proceedings of the 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT'07). 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Alaghi, A., Sedghi, M., Karimi, N., Fathy, M., and Navabi, Z. 2008. Reliable noc architecture utilizing a robust rerouting algorithms. In 9th IEEE East-West Design and Test Symposium (EWDTS'08).Google ScholarGoogle Scholar
  6. Ali, M., Welzl, M., and Hellebrand, S. 2005. A dynamic routing mechanism for network on chip. In Proceedings of the 23rd NORCHIP Conference. 70--73.Google ScholarGoogle Scholar
  7. Ali, M., Welzl, M., and Hessler, S. 2007. And end 2 end reliability protocol to address transient faults in network on chips. In Digest of the Workshop on Diagnostic Services in Network-on-Chips.Google ScholarGoogle Scholar
  8. Anghel, L. and Nicolaidis, M. 2000. Cost reduction and evaluation of a temporary faults detecting technique. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition. 591--598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Angiolini, F., Meloni, P., Carta, S., Benini, L., and Raffo, L. 2006. Contrasting a noc and a traditional interconnect fabric with layout awareness. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'06). Vol. 1. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1, 1, 11--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Baumann, R. 2005. Soft errors in advanced computer systems. IEEE Des. Test Comput. 22, 3, 258--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bell, S., Edwards, B., Amann, J., Conlin, R., Joyce, K., Leung, V., Mackay, J., Reif, M., Bao, L., Brown, J., Mattina, M., Miao, C.-C., Ramey, C., Wentzlaff, D., Anderson, W., Berger, E., Fairbanks, N., Khan, D., Montenegro, F., Stickney, J., and Zook, J. 2008. TILE64 processor: A 64-core SoC with mesh interconnect. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC'08). 87--90.Google ScholarGoogle Scholar
  13. Bertozzi, D., Benini, L., and De Micheli, G. 2002. Low power error resilient encoding for on-chip data buses. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition. 102--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Bertozzi, D., Benini, L., and De Micheli, G. 2005. Error control schemes for on-chip communication links: The energy reliability tradeoff. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 24, 6, 818--831. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bjerregaard, T. and Mahadevan, S. 2006. A survey of research and practices of network-on-chip. ACM Comput. Surv. 38, 1--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Bobda, C., Ahmadinia, A., Majer, M., Teich, J., Fekete, S., and Van Der Veen, J. 2005. Dynoc: A dynamic infrastructure for communication in dynamically reconfigurable devices. In Proceedings of the International Field Programmable Logic and Applications Conference. 153--158.Google ScholarGoogle ScholarCross RefCross Ref
  17. Bogdan, P., Dumitras, T., and Marculescu, R. 2007. Stochastic communication: A new paradigm for fault-tolerant networks-on-chip. VLSI Des. 2007, 1--17.Google ScholarGoogle ScholarCross RefCross Ref
  18. Bolotin, E., Cidon, I., Ginosar, R., and Kolodny, A. 2007. Routing table minimization for irregular mesh nocs. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'07). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Bondavalli, A., Chiaradonna, S., Giandomenico, F. D., and Grandoni, F. 2000. Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Trans. Comput. 49, 4, 230--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Boppana, R. V. and Chalasani, S. 1995. Fault-tolerant wormhole routing algorithms for mesh networks. IEEE Trans. Comput. 44, 7, 848--864. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Borkar, S. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6, 10--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Borkar, S. 2007. Thousand core chips: A technology perspective. In Proceedings of the 44th Annual Design Automation Conference (DAC'07). ACM Press, New York, 746--749. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Boyan, J. and Littman, M. 1994. Packet routing in dynamically changing networks: A reinforcement learning approach. Adv. Neural Inf. Process. Syst. 6, 671--678.Google ScholarGoogle Scholar
  24. Breuer, M., Gupta, S., and Mak, T. 2004. Defect and error tolerance in the presence of massive numbers of defects. IEEE Des. Test Comput. 21, 3, 216--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chen, C.-L. and Chiu, G.-M. 2001. A fault-tolerant routing scheme for meshes with nonconvex faults. IEEE Trans. Parallel Distrib. Syst. 12, 5, 467--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Concatto, C., Matos, D., Carro, L., Kastensmidt, F., Susin, A., Cota, E., and Kreutz, M. 2009. Fault tolerant mechanism to improve yield in nocs using a reconfigurable router. In Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design (SBCCI'09). ACM Press, New York, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Constantinescu, C. 2003. Trends and challenges in vlsi circuit reliability. IEEE Micro 23, 4, 14--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Constantinides, K., Plaza, S., Blome, J., Zhang, B., Bertacco, V., Mahlke, S., Austin, T., and Orshansky, M. 2006. Bulletproof: A defect-tolerant cmp switch architecture. In Proceedings of the 12th International High-Performance Computer Architecture Symposium. 5--16.Google ScholarGoogle Scholar
  29. Cota, E., Kastensmidt, F., Cassel, M., Herve, M., Almeida, P., Meirelles, P., Amory, A., and Lubaszewski, M. 2008. A high-fault-coverage approach for the test of data, control and handshake interconnects in mesh networks-on-chip. IEEE Trans. Comput. 57, 9, 1202--1215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Cuviello, M., Dey, S., Bai, X., and Zhao, Y. 1999. Fault modeling and simulation for crosstalk in system-on-chip interconnects. In IEEE/ACM International Digest of Technical Papers on Computer-Aided Design. 297--303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Dalirsani, A., Holst, S., Elm, M., and Wunderlich, H. 2011. Structural test for graceful degradation of noc switches. In Proceedings of the European Test Symposium (ETS'11). 183--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. De Micheli, G. and Benini, L. 2006. Networks On Chips: Technology and Tools. Morgan Kaufmann Publishers.Google ScholarGoogle Scholar
  33. Dodd, P. and Massengill, L. 2003. Basic mechanisms and modeling of single-event upset in digital microelectronics. IEEE Trans. Nuclear Sci. 50, 3, 583--602.Google ScholarGoogle ScholarCross RefCross Ref
  34. Duan, X., Zhang, D., and Sun, X. 2009. Fault-tolerant routing schemes for wormhole mesh. In Proceedings of the IEEE International Parallel and Distributed Processing with Applications Symposium. 298--301.Google ScholarGoogle Scholar
  35. Duato, J., Lysne, O., Pang, R., and Pinkston, T. 2005. Part i: A theory for deadlock-free dynamic network reconfiguration. IEEE Trans. Parallel Distrib. Syst. 16, 5, 412--427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Dubrova, E. 2008. Fault-Tolerant Design: An Introduction. Kluwer Academic Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Dumitras, T. and Marculescu, R. 2003. On-chip stochastic communication {soc applications}. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'03). 790--795. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Dutta, A. and Touba, N. 2007. Reliable network-on-chip using a low cost unequal error protection code. In Proceedings of the 22nd IEEE International Defect and Fault-Tolerance in VLSI Systems Symposium (DFT'07). 3--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Eghbal, A., Yaghini, P. M., Pedram, H., and Zarandi, H. R. 2010. Designing fault-tolerant network-on-chip router architecture. Int. J. Electron. 97, 10, 1181--1192.Google ScholarGoogle ScholarCross RefCross Ref
  40. Ejlali, A., Al-Hashimi, B. M., Rosinger, P., and Miremadi, S. G. 2007. Joint consideration of fault-tolerance, energy efficiency and performance in on-chip networks. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'07). 647--1652. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Elakkumanan, P., Prasad, K., and Sridhar, R. 2006. Time redundancy based scan flip-flop reuse to reduce ser of combinational logic. In Proceedings of the 7th International Symposium on Quality Electronic Design (ISQED'06). IEEE Computer Society, Los Alamitos, CA, 617--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ernst, D., Kim, N. S., Das, S., Pant, S., Rao, R., Pham, T., Ziesler, C., Blaauw, D., Austin, T., Flautner, K., and Mudge, T. 2003. Razor: A low-power pipeline based on circuit-level timing speculation. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'03). 7--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Feng, C., Lu, Z., Jantsch, A., Li, J., and Zhang, M. 2010a. FoN: Fault-on-neighbor aware routing algorithm for networks-onchip. In International SOC Conference.Google ScholarGoogle Scholar
  44. Feng, C., Lu, Z., Jantsch, A., Li, J., and Zhang, M. 2010b. A reconfigurable fault-tolerant deflection routing algorithm based on reinforcement learning for networks-on-chip. In Proceedings of the International Workshop on Network on Chip Architectures (NoCArc'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Fick, D., Deorio, A., Chen, G., Bertacco, V., Sylvester, D., and Blaauw, D. 2009a. A highly resilient routing algorithm for fault-tolerant nocs. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'09). 21--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Fick, D., Deorio, A., Hu, J., Bertacco, V., Blaauw, D., and Sylvester, D. 2009b. Vicis: A reliable network for unreliable silicon. In Proceedings of the 46th Annual Design Automation Conference (DAC'09). ACM Press, New York, 812--817. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Fiorin, L., Micconi, L., and Sami, M. 2011. Design of fault tolerant network interfaces for nocs. In Proceedings of the 14th Euromicro Conference on Digital System Design. 393--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Flich, J., Mejia, A., Lopez, P., and Duato, J. 2007. Region-based routing: An efficient routing mechanism to tackle unreliable hardware in network on chips. In Proceedings of the Symposium on Networks-on-Chip (NOCS'07). 183--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Flich, J., Skeie, T., Mejia, A., Lysne, O., Lopez, P., Robles, A., Duato, J., Koibuchi, M., Rokicki, T., and Sancho, J. 2012. A survey and evaluation of topology-agnostic deterministic routing algorithms. IEEE Trans. Parallel Distrib. Syst. 23, 3, 405--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Forney, G. D. 1973. The viterbi algorithm. Proc. IEEE 61, 3, 268--278.Google ScholarGoogle ScholarCross RefCross Ref
  51. Frantz, A., Kastensmidt, F., Carro, L., and Cota, E. 2006a. Dependable network-on-chip router able to simultaneously tolerate soft errors and crosstalk. In Proceedings of the IEEE International Test Conference (ITC'06). 1--9.Google ScholarGoogle Scholar
  52. Frantz, A. P., Cassel, M., Kastensmidt, F. L., Cota, E., and Carro, L. 2007. Crosstalk- and seu-aware networks on chips. IEEE Des. Test Comput. 24, 4, 340--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Frantz, A. P., Kastensmidt, F. L., Carro, L., and Cota, E. 2006b. Evaluation of seu and crosstalk effects in network-on-chip switches. In Proceedings of the Symposium on Integrated Circuits and Systems Design (SBCCI'06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Fu, B. and Ampadu, P. 2009. On hamming product codes with type-ii hybrid arq for on-chip interconnects. IEEE Trans. Circ. Syst. I: Regular Papers 56, 9, 2042--2054. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Fukushima, Y., Fukushi, M., and Horiguchi, S. 2009. Fault-tolerant routing algorithm for network on chip without virtual channels. In Proceedings of the 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'09). 313--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Furber, S. 2006. Living with failure: Lessons from nature? In Proceedings of the European Test Symposium (ETS'06). 4--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Gadlage, M., Ahlbin, J., Narasimham, B., Bhuva, B., Massengill, L., Reed, R., Schrimpf, R., and Vizkelethy, G. 2010. Scaling trends in set pulse widths in sub-100 nm bulk cmos processes. IEEE Trans. Nuclear Sci. 57, 6, 3336--3341.Google ScholarGoogle Scholar
  58. Ganguly, A., Pande, P. P., and Belzer, B. 2009. Crosstalk-aware channel coding schemes for energy efficient and reliable noc interconnects. IEEE Trans. Very Large Scale Inter Syst. 17, 11, 1626--1639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Ganguly, A., Pande, P. P., Belzer, B., and Grecu, C. 2007. Addressing signal integrity in networks on chip interconnects through crosstalk-aware double error correction coding. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI'07). 317--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Gizopoulos, D., Psarakis, M., Adve, S. V., Ramachandran, P., Hari, S. K. S., Sorin, D., Biswas, A. M. A., and Vera, X. 2011. Architectures for online error detection and recovery in multicore processors. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'11).Google ScholarGoogle Scholar
  61. Glass, C. J. and Ni, L. M. 1993. Fault-tolerant wormhole routing in meshes. In Proceedings of the 23rd International Fault-Tolerant Computing Digest of Papers Symposium (FTCS'93). 240--249.Google ScholarGoogle Scholar
  62. Grecu, C., Ivanov, A., Pande, R., Jantsch, A., Salminen, E., Ogras, U., and Marculescu, R. 2007. Towards open network-on-chip benchmarks. In Proceedings of the 1st International Symposium on Networks-on-Chip (NOCS'07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Grecu, C., Ivanov, A., Saleh, R., and Pande, P. P. 2006a. Noc interconnect yield improvement using crosspoint redundancy. In Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'06). 457--465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Grecu, C., Ivanov, A., Saleh, R., Sogomonyan, E., and Pande, P. P. 2006b. On-line fault detection and location for noc interconnects. In Proceedings of the 12th IEEE International On-Line Testing Symposium (IOLTS'06). 145--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Hazucha, P., Karnik, T., Maiz, J., Walstra, S., Bloechel, B., Tschanz, J., Dermer, G., Hareland, S., Armstrong, P., and Borkar, S. 2003. Neutron soft error rate measurements in a 90-nm cmos process and scaling trends in sram from 0.25-mu;m to 90-nm generation. In IEEE International Electron Devices Meeting Technical Digest (IEDM'03). 21.5.1--21.5.4.Google ScholarGoogle Scholar
  66. Hegde, R. and Shanbhag, N. 2000. Toward achieving energy efficiency in presence of deep submicron noise. IEEE Trans. Syst. 8, 4, 379--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Hernandez, C., Federico, F., Santonja, V., and Duato, J. 2009. A new mechanism to deal with process variability in noc links. In Proceedings of the International Parallel and Distributed Processing Symposium (PDPS'09). 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Hoskote, Y., Vangal, S., Singh, A., Borkar, N., and Borkar, S. 2007. A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro 27, 5, 51--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Hu, J. and Marculescu, R. 2004. Dyad - smart routing for networks-on-chip. In Proceedings of the 41st Design Automation Conference (DAC'04). 260--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Huffman, W. C. and Pless, V. 2003. Fundamentals of Error-Correcting Codes. Cambridge University Press.Google ScholarGoogle Scholar
  71. INTEL LABS. 2010. The scc platform overview. Tech. rep. revision 0.7, Intel Corporation. http://www.intel.la/content/dam/www/public/us/en/documents/technology-briefs/intel-labs-single-chip-platform-overview-paper.pdf.Google ScholarGoogle Scholar
  72. ITRS. 2009. International technology roadmap for semiconductors. Tech. rep., ITRS Technology Working Group. http://www.itrs.net/Links/2009ITRS/2009Chapters_2009Tables/2009_Interconnect.pdf.Google ScholarGoogle Scholar
  73. Jantsch, A., Lauter, R., and Vitkowski, A. 2005. Power analysis of link level and end-to-end data protection in networks on chip. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'05). Vol. 2. 1770--1773.Google ScholarGoogle Scholar
  74. Jovanovic, S., Tanougast, C., Weber, S., and Bobda, C. 2009. A new deadlock-free fault-tolerant routing algorithm for noc interconnections. In Proceedings of the International Conference on Field Programmable Logic (FPL'09). 326--331.Google ScholarGoogle Scholar
  75. Kakoee, M. R., Bertacco, V., and Benini, L. 2011a. A distributed and topology-agnostic approach for on-line noc testing. In Proceedings of the Network on Chip Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Kakoee, M. R., Bertacco, V., and Benini, L. 2011b. Relinoc: A reliable network for priority-based on-chip communication. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'11).Google ScholarGoogle Scholar
  77. Keane, J. and Kim, C. 2011. An odometer for cpus. IEEE Spectrum 48, 5, 26--31.Google ScholarGoogle ScholarCross RefCross Ref
  78. Keane, J., Kim, T.-H., and Kim, C. H. 2007. An on-chip nbti sensor for measuring pmos threshold voltage degradation. In Proceedings of the International Symposium on Low Power Electronics and Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Kim, J., Nicopoulos, C., Park, D., Narayanan, V., Yousif, M. S., and Das, C. R. 2006. A gracefully degrading and energyefficient modular router architecture for on-chip networks. In Proceedings of the International Symposium on Computer Architecture (ISCA'06). 4--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Kim, J., Park, D., Nicopoulos, C., Vijaykrishnan, N., and Das, C. 2005. Design and analysis of an noc architecture from performance, reliability and energy perspective. In Proceedings of the Symposium on Architecture for Networking and Communications Systems (ANCS'05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Kim, Y. B. and Kim, Y.-B. 2007. Fault tolerant source routing for network-on-chip. In Proceedings of the 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT'07). 12--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Kohler, A. and Radetzki, M. 2009. Fault-tolerant architecture and deflection routing for degradable noc switches. In Proceedings of the 3rd ACM/IEEE International Symposium on Networks-on-Chips (NOCS'09). 22--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Kohler, A., Schley, G., and Radetzki, M. 2010. Fault tolerant network on chip switching with graceful performance degradation. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 20, 6, 883--896. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Koibuchi, M., Matsutani, H., Amano, H., and Pinkston, T. M. 2008. A lightweight fault-tolerant mechanism for networkon-chip. In Proceedings of the 2nd ACM/IEEE International Symposium on Networks-on-Chip (NoCS'08). 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Koupaei, F. K., Khademzadeh, A., and Janidarmian, M. 2011. Fault-tolerant application-specific network-on-chip. In Proceedings of the World Congress on Engineering and Computer Science.Google ScholarGoogle Scholar
  86. Kuhn, K., Kenyon, C., Kornfeld, A., Liu, M., Maheshwari, A., Kai Shih, W., Sivakumar, S., Taylor, G., Vandervoorn, P., and Zawadzki, K. 2008. Managing process variation in Intel's 45nm CMOS technology. Intel Technol. J. 12, 2.Google ScholarGoogle Scholar
  87. Lee, H., Chang, N., Ogras, U., and Marculescu, R. 2007. On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches. ACM Trans. Des. Autom. Electron. Syst. 12, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Lehtonen, T., Liljeberg, P., and Plosila, J. 2007a. Analysis of forward error correction methods for nanoscale networks-onchip. In Proceedings of the 2nd International Conference on Nano-Networks (Nano-Net'07). Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Lehtonen, T., Liljeberg, P., and Plosila, J. 2007b. Online reconfigurable self-timed links for fault tolerant noc. VLSI Des. 2007, 13.Google ScholarGoogle ScholarCross RefCross Ref
  90. Lehtonen, T., Wolpert, D., Liljeberg, P., Plosila, J., and Ampadu, P. 2010. Self-adaptive system for addressing permanent errors in on-chip interconnects. IEEE Trans. VLSI Syst. 18, 4, 527--540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Lin, S.-Y., Shen, W.-C., Hsu, C.-C., Chao, C.-H., and Wu, A.-Y. 2009. Fault-tolerant router with built-in self-test/self-diagnosis and fault-isolation circuits for 2d-mesh based chip multiprocessor systems. In Proceedings of the International Symposium on VLSI Design, Automation and Test (VLSI-DAT'09). 72--75.Google ScholarGoogle Scholar
  92. Lysne, O., Pinkston, T., and Duato, J. 2005. Part ii: A methodology for developing deadlock-free dynamic network reconfiguration processes. IEEE Trans. Parallel Distrib. Syst. 16, 5, 428--443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Majer, M., Bobda, C., Ahmadinia, A., and Teich, J. 2005. Packet routing in dynamically changing networks on chip. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. 154b--154b. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Malkin, G. and Steenstrup, M. 1995. Distance-vector routing. In Routing in Communication Networks, M. Steenstrup, Ed., Prentice Hall, 83--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Marculescu, R., Ogras, U., Peh, L.-S., Jerger, N., and Hoskote, Y. 2009. Outstanding research problems in noc design: System, microarchitecture, and circuit perspectives. IEEE Trans. Comput. 28, 1, 3--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Mcpherson, J. 2006. Reliability challenges for 45nm and beyond. In Proceedings of the 43rd ACM/IEEE Design Automation Conference (DAC'06). 176--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Mediratta, S. D. and Draper, J. 2007. Performance evaluation of probe-send fault-tolerant network-on-chip router. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'07). 69--75.Google ScholarGoogle Scholar
  98. Mejia, A., Flich, J., Duato, J., Reinemo, S.-A., and Skeie, T. 2006. Segment-based routing: An efficient fault-tolerant routing algorithm for meshes and tori. In Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS'06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Mejia, A., Palesi, M., Flich, J., Kumar, S., Lopez, P., Holsmark, R., and Duato, J. 2009. Region-based routing: A mechanism to support efficient routing algorithms in nocs. IEEE Trans. Syst. 17, 3, 356--369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Mintarno, E., Skaf, J., Zheng, R., Velamala, J. B., Cao, Y., Boyd, S., Dutton, R. W., and Mitra, S. 2011. Selftuning for maximized lifetime energy-efficiency in the presence of circuit aging. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 30, 5, 760--773. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Miranda, E. and Sune, J. 2004. Electron transport through broken down ultra-thin sio2 layers in mos devices. Microelectron. Reliabil. 44, 1, 1--23.Google ScholarGoogle ScholarCross RefCross Ref
  102. Mitra, S., Zhang, M., Waqas, S., Seifert, N., Gill, B., and Kim, K. S. 2006. Combinational logic soft error correction. In Proceedings of the IEEE International Test Conference (ITC'06). 1--9.Google ScholarGoogle Scholar
  103. Moy, J. 1995. Link-state routing. In Routing in Communication Networks, M. Ste, Ed., Prentice Hall, 135--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Murali, S., Atienza, D., Benini, L., and De Micheli, G. 2006. A multi-path routing strategy with guaranteed in-order packet delivery and fault-tolerance for networks on chip. In Proceedings of the 43rd ACM/IEEE Design Automation Conference (DAC'06). 845--848. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Murali, S., Theocharides, T., Vijaykrishnan, N., Irwin, M., Benini, L., and Demicheli, G. 2005. Analysis of error recovery schemes for networks on chips. IEEE Des. Test Comput. 22, 5, 434--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Nicolaidis, M. 1999. Time redundancy based soft-error tolerance to rescue nanometer technologies. In Proceedings of the 17th IEEE VLSI Test Symposium. 86--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Ogras, U., Hu, J., and Marculescu, R. 2005. Key research problems in noc design: A holistic perspective. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Owens, J., Dally, W., Ho, R., Jayasimha, D., Keckler, S., and Peh, L.-S. 2007. Research challenges for on-chip interconnection networks. IEEE Micro 27, 5, 96--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Palesi, M., Kumar, S., and Catania, V. 2010. Leveraging partially faulty links usage for enhancing yield and performance in networks-on-chip. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 29, 426--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Pande, P. P., Ganguly, A., Feero, B., Belzer, B., and Grecu, C. 2006. Design of low power and reliable networks on chip through joint crosstalk avoidance and forward error correction coding. In Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'06). 466--476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Parikh, R. and Bertacco, V. 2011. Formally enhanced runtime verification to ensure noc functional correctness. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'11). 410--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Park, D., Nicopoulos, C., Kim, J., Vijaykrishnan, N., and Das, C. R. 2006. Exploring fault-tolerant network-on-chip architectures. In Proceedings of the International Conference on Dependable Systems and Networks (DSN'06). IEEE Computer Society, Los Alamitos, CA, 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Patooghy, A. and Miremadi, S. G. 2008. Ltr: A low-overhead and reliable routing algorithm for network on chips. In Proceedings of the International SoC Design Conference (ISOCC'08). Vol. 1.Google ScholarGoogle Scholar
  114. Patooghy, A., Miremadi, S. G., and Shafaei, M. 2010. Crosstalk modeling to predict channel elay in network-on-chips. In Proceedings of the IEEE International Conference on Computer Design (ICCD'10). 396--401.Google ScholarGoogle Scholar
  115. Pirretti, M., Link, G. M., Brooks, R. R., Vijaykrishnan, N., Kandemir, M. T., and Irwin, M. J. 2004. Fault tolerant algorithms for network-on-chip interconnect. In Proceedings of the International Symposium on VLSI (ISVLSI'04). IEEE Computer Society, Los Alamitos, CA, 46--51.Google ScholarGoogle Scholar
  116. Puente, V., Gregorio, J. A., Vallejo, F., and Beivide, R. 2008. Immunet: Dependable routing for interconnection networks with arbitrary topology. IEEE Trans. Comput. 57, 12, 1676--1689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Radetzki, M. 2011. Fault-tolerant differential q routing in arbitrary noc topologies. In Proceedings of the International Conference on Embedded and Ubiquitous Computing (EUC'11). 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. Raik, J., Ubar, R., and Govind, V. 2007. Test configurations for diagnosing faulty links in noc switches. In Proceedings of the 12th IEEE European Test Symposium (ETS'07). 29--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Rantala, V., Lehtonen, T., Liljeberg, P., and Plosila, J. 2009. Multi network interface architectures for fault tolerant network-on-chip. In Proceedings of the International Symposium on Signals, Circuits and Systems. 1--4.Google ScholarGoogle Scholar
  120. Ravindran, D. K. 2009. Structural fault-tolerance on the noc circuit level. Tech. rep., Institut fur Technische Informatik, Universitat Stuttgart. June.Google ScholarGoogle Scholar
  121. Rodrigo, S., Flich, J., Duato, J., and Hummel, M. 2008. Efficient unicast and multicast support for cmps. In Proceedings of the 41st IEEE/ACM International Symposium on Microarchitecture (MICRO'08). 364--375. Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Rodrigo, S., Flich, J., Roca, A., Medardoni, S., Bertozzi, D., Camacho, J., Silla, F., and Duato, J. 2010. Addressing manufacturing challenges with cost-efficient fault tolerant routing. In Proceedings of the 4th ACM/IEEE International Networks-on-Chip Symposium (NOCS'10). 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Rossi, D., Angelini, P., and Metra, C. 2007. Configurable error control scheme for noc signal integrity. In Proceedings of the International On-Line Testing Symposium (IOLTS'07). 43--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Saha, S. 2010. Modeling process variability in scaled cmos technology. IEEE Des. Test Comput. 27, 2, 8--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Sanyo Semiconductors. 2011. Quality and reliability handbook ver 3. http://semicon.sanyo.com/en/reliability/.Google ScholarGoogle Scholar
  126. Schroeder, M. D., Birrell, A. D., Burrows, M., Murray, H., Needham, R. M., Rodeheffer, T. L., Satterthwaite, E. H., and Thacker, C. P. 1991. Autonet: A high-speed, self-configuring local area network using point-to-point links. IEEE J. Selected Areas Comm. 9, 8, 1318--1335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Schafer, M., Hollstein, T., Zimmer, H., and Glesner, M. 2005. Deadlock-free routing and component placement for irregular mesh-based networks-on-chip. In Proceedings of the International Conference on Computer Aided Design (ICCAD'05). 238--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. Schonwald, T., Zimmermann, J., Bringmann, O., and Rosenstiel, W. 2007. Fully adaptive fault-tolerant routing algorithm for network-on-chip architectures. In Proceedings of the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD'07). 527--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. Shamshiri, S., Ghofrani, A., and Cheng, K.-T. 2011. End-to-end error correction and online diagnosis for on-chip networks. In Proceedings of the International Test Conference.Google ScholarGoogle Scholar
  130. Shivakumar, P., Kistler, M., Keckler, S. W., Burger, D., and Alvisi, L. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Networks. Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. Shooman, M. L. 2002. Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design. John Wiley & Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. Song, W., Edwards, D., Nunez-Yanez, J., and Dasgupta, S. 2009. Adaptive stochastic routing in fault-tolerant on-chip networks. In Proceedings of the 3rd ACM/IEEE International Symposium on Networks-on-Chip (NoCS'09). 32--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Sridhara, S. and Shanbhag, N. 2005. Coding for system-on-chip networks: A unified framework. IEEE Trans. VLSI Syst. 13, 6, 655--667. Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Strano, A., Bertozzi, D., Trivino, F., Sanchez, J. L., Alfaro, F. J., and Flich, J. 2012. Osr-lite: Fast and deadlock-free noc reconfiguration framework. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modelling and Simulation.Google ScholarGoogle Scholar
  135. Takeda, E. and Yang, C. 1995. Hot-Carrier Effects in MOS Devices. Academic Press.Google ScholarGoogle Scholar
  136. Tamhankar, R., Murali, S., and De Micheli, G. 2005. Performance driven reliable link design for networks on chips. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC'05). Vol. 2. 749--754. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Iyer, P., Singh, A., Jacob, T., Jain, S., Venkataraman, S., Hoskote, Y., and Borkar, N. 2007. An 80-tile 1.28tflops network-on-chip in 65nm cmos. In Digest of Technical Papers of the IEEE International Solid-State Circuits Conference (ISSCC'07). 98--589.Google ScholarGoogle Scholar
  138. Viterbi, A. J. 1971. Convolutional codes and their performance in communication systems. IEEE Trans. Comm. Technol. 19, 5, 751--772.Google ScholarGoogle ScholarCross RefCross Ref
  139. Vitkovski, A., Jantsch, A., Lauter, R., Haukilahti, R., and Nilsson, E. 2008. Low-power and error protection coding for network-on-chip traffic. IET Comput. Digital Techn. 2, 6, 483--492.Google ScholarGoogle ScholarCross RefCross Ref
  140. Vitkovskiy, A., Soteriou, V., and Nicopoulos, C. 2010. A fine-grained link-level fault-tolerant mechanism for networks-onchip. In Proceedings of the IEEE International Computer Design Conference (ICCD'10). 447--454.Google ScholarGoogle Scholar
  141. Walker, M. 2000. Modeling the wiring of deep submicron ics. IEEE Spectrum 37, 3, 65--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. Wittmann, R., Puchner, H., Hinh, L., Ceric, H., Gehring, A., and Selberherr, S. 2005. Simulation of dynamic nbti degradation for a 90nm cmos technology. In Proceedings of the Nanotechnology Conference.Google ScholarGoogle Scholar
  143. Wu, E., Lai, W., Nowak, E., Mckenna, J., Vayshenker, A., and Harmon, D. 2001. Interplay of voltage and temperature acceleration of oxide breakdown for ultra-thin oxides. Microelectron. Engin. 59, 25--31.Google ScholarGoogle ScholarCross RefCross Ref
  144. Wu, J. and Wang, D. 2002. Fault-tolerant and deadlock-free routing in 2-d meshes using rectilinear-monotone polygonal fault blocks. In Proceedings of the International Conference on Parallel Processing. 247--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. Xinming, D. and Xuemei, S. 2010. Fault-tolerant routing in a prdt(2,1)-based noc. In Proceedings of the 2nd International Computer Engineering and Technology Conference (ICCET'10).Google ScholarGoogle Scholar
  146. Yaghini, P. M., Eghbal, A., Pedram, H., and Zarandi, H. R. 2011. Investigation of transient fault effects in synchronous and asynchronous network on chip router. J. Syst. Archit. 57, 1, 61--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  147. Yang, Y. 2010. Issues of esd protection in nano-scale cmso. Ph.D. thesis, George Mason University, Fairfax, Virginia, USA.Google ScholarGoogle Scholar
  148. Yu, A. J. and Lemieux, G. G. 2005. Fpga defect tolerance: Impact of granularity. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT'05). 189--196.Google ScholarGoogle Scholar
  149. Yu, Q. and Ampadu, P. 2008. Adaptive error control for noc switch-to-switch links in a variable noise environment. In Proceedings of the IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems (DFTVS'08). 352--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. Yu, Q. and Ampadu, P. 2010. Transient and permanent error co-management method for reliable networks-on-chip. In Proceedings of the 4th ACM/IEEE International Networks-on-Chip Symposium (NOCS'10). 145--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. Yu, Q. and Ampadu, P. 2011. A dual-layer method for transient and permanent error co-management in noc links. IEEE Trans. Circ. Syst. II: Express Briefs 58, 1, 36--40.Google ScholarGoogle ScholarCross RefCross Ref
  152. Yu, Q. and Ampadu, P. 2012. Dual-layer adaptive error control for network-on-chip links. IEEE Trans. VLSI. Syst. 20, 7, 1304--1317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  153. Yu, Q., Cano, J., Flich, J., and Ampadu, P. 2012. Transient and permanent error control for high-end multiprocessor systems-on- chip. In Proceedings of the 6th IEEE/ACM International Symposium on Networks on Chip (NoCS'12). 169--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  154. Yu, Q., Zhang, B., Li, Y., and Ampadu, P. 2010. Error control integration scheme for reliable noc. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'10). 3893--3896.Google ScholarGoogle Scholar
  155. Yu, Q., Zhang, M., and Ampadu, P. 2011. Exploiting inherent information redundancy to manage transient errors in noc routing arbitration. In Proceedings of the IEEE Network on Chip Symposium (NoCS'11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  156. Zhang, B. and Orshansky, M. 2008. Modeling of nbti-induced pmos degradation under arbitrary dynamic temperature variation. In Proceedings of the 9th International Symposium on Quality Electronic Design (ISQED'08). 774--779. Google ScholarGoogle ScholarDigital LibraryDigital Library
  157. Zhang, M. and Shanbhag, N. 2006. Soft-error-rate-analysis (sera) methodology. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 25, 10, 2140--2155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  158. Zhang, Y. and Jiang, J. 2008. Bibliographical review on reconfigurable fault-tolerant control systems. Ann. Rev. Control 32, 229--252.Google ScholarGoogle ScholarCross RefCross Ref
  159. Zhang, Y., Li, H., and Li, X. 2009. Selected crosstalk avoidance code for reliable network-on-chip. J. Comput. Sci. Technol. 24, 6, 1074--1085.Google ScholarGoogle ScholarCross RefCross Ref
  160. Zhang, Y., Parikh, D., Sankaranarayanan, K., Skadron, K., and Stan, M. 2003. Hotleakage: A temperature-aware model of subthreshold and gate leakage for architects. Tech. rep. CS-2003--05, University of Virgiania, Department of Computer Science. March.Google ScholarGoogle Scholar
  161. Zhang, Z., Greiner, A., and Taktak, S. 2008. A reconfigurable routing algorithm for a fault-tolerant 2d-mesh network-on-chip. In Proceedings of the Design Automation Conference (DAC'08). 441--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  162. Zimmer, H. and Jantsch, A. 2003. A fault model notation and error-control scheme for switch-to-switch buses in a network-onchip. In Proceedings of the 1st IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. 188--193. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Methods for fault tolerance in networks-on-chip

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Computing Surveys
        ACM Computing Surveys  Volume 46, Issue 1
        October 2013
        551 pages
        ISSN:0360-0300
        EISSN:1557-7341
        DOI:10.1145/2522968
        Issue’s Table of Contents

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 July 2013
        • Accepted: 1 January 2013
        • Revised: 1 September 2012
        • Received: 1 February 2012
        Published in csur Volume 46, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader