article

Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

Authors:
Feihui Li

Pennsylvania State University

Pennsylvania State University
View Profile

,
Chrysostomos Nicopoulos

Pennsylvania State University

Pennsylvania State University
View Profile

,
Thomas Richardson

Pennsylvania State University

Pennsylvania State University
View Profile

,
Yuan Xie

Pennsylvania State University

Pennsylvania State University
View Profile

,
Vijaykrishnan Narayanan

Pennsylvania State University

Pennsylvania State University
View Profile

,
Mahmut Kandemir

Pennsylvania State University

Pennsylvania State University
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 34 Issue 2May 2006pp 130–141https://doi.org/10.1145/1150019.1136497

Published:01 May 2006Publication History

ACM SIGARCH Computer Architecture News

Abstract

Long interconnects are becoming an increasingly important problem from both power and performance perspectives. This motivates designers to adopt on-chip network-based communication infrastructures and three-dimensional (3D) designs where multiple device layers are stacked together. Considering the current trends towards increasing use of chip multiprocessing, it is timely to consider 3D chip multiprocessor design and memory networking issues, especially in the context of data management in large L2 caches. The overall goal of this paper is to study the challenges for L2 design and management in 3D chip multiprocessors. Our first contribution is to propose a router architecture and a topology design that makes use of a network architecture embedded into the L2 cache memory. Our second contribution is to demonstrate, through extensive experiments, that a 3D L2 memory architecture generates much better results than the conventional two-dimensional (2D) designs under different number of layers and vertical (inter-wafer) connections. In particular, our experiments show that a 3D architecture with no dynamic data migration generates better performance than a 2D architecture that employs data migration. This also helps reduce power consumption in L2 due to a reduced number of data movements.

References

{1} V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger. Clock Rate Versus IPC: The End of the Road for Conventional Microarchitectures. In Proc. the 27th International Symposium on Computer Architecture, June 2000. Google ScholarDigital Library
{2} B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proc. the International Symposium on Microarchitecture, 2004. Google ScholarDigital Library
{3} Benini and De Micheli. Networks on Chips: A New SoC Paradigm. IEEE Computer, 2002. Google ScholarDigital Library
{4} B. Black et al. 3D Processing technology and Its Impact on IA32 Microprocessors. In Proc. the International Conference on Computer Design, 2004. Google ScholarDigital Library
{5} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proc. the 36th annual IEEE/ACM International Symposium on Microarchitecture, 2003. Google ScholarDigital Library
{6} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimization replication, communication, and capacity allocation in CMPs. In Proc. the International Symposium on Computer Architectures , 2005. Google ScholarDigital Library
{7} J. Cong and Y. Zhang. Thermal-Driven Multilevel Routing for 3-D ICs. In Proc. the Asia South Pacific Design Automation Conference, Jan. 2005. Google ScholarDigital Library
{8} W. Dally and B. Towles. Route Packets, Not Wires: On-Chip Inteconnection Networks. In Proc. the 38th Conference on Design Automation, 2001. Google ScholarDigital Library
{9} S. Das et al. Technology, Performance, and Computer Aided Design of Three-Dimensional Integrated Circuits. In Proc. International Symposium on Physical Design, 2004. Google ScholarDigital Library
{10} W. R. Davis et al. Demystifying 3d ics: The pros and cons of going vertical. IEEE Design and Test of Computers, 22(6), Nov. 2005. Google ScholarDigital Library
{11} Y. Deng et al. 2.5D System Integration: A Design Driven System Implementation Schema. In Proc. the Asia South Pacific Design Automation Conference, 2004. Google ScholarDigital Library
{12} L. Hammond, B. Nayfeh, and K. Olukotun. A Single-Chip Multiprocessor. IEEE Computer Special Issue on "Billion-Transistor Processors", Sept. 1997. Google ScholarDigital Library
{13} R. Ho, K. Mai, and M. Horowitz. The Future of Wires. Proc. the IEEE, 89(4), Apr. 2001.Google ScholarCross Ref
{14} J. Hu and R. Marculescu. Energy- and performance-aware mapping for regular NoC architectures. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 24(4), Apr. 2005. Google ScholarDigital Library
{15} J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler. A NUCA substrate for flexible CMP cache sharing. In Proc. the 19th Annual International Conference on Supercomputing , 2005. Google ScholarDigital Library
{16} M. Ieong et al. Three Dimensional CMOS Devices and Integrated Circuits. In Proc. IEEE Custom Integrated Circuits Conference, 2003.Google Scholar
{17} J. Joyner, P. Zarkesh-Ha, and J. Meindl. A stochastic global net-length distribution for a three-dimensional system-on-a-chip (3D-SoC). In Proc. 14th Annual IEEE International ASIC/SOC Conference, Sept. 2001.Google ScholarCross Ref
{18} S. Jung et al. The Revolutionary and Truly 3-Dimentional 25F2 SRAM Technology with the Smallest S3 Cell, 0.16um2 and SSTFF for Ultra High Density SRAM. In VLSI Technology Digest of Technical Papers. 2004.Google Scholar
{19} J. Kahle, M. Day, H. Hofstee, C. Johns, T. Maeurer, and D. Shippy. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4-5), 2005. Google ScholarDigital Library
{20} C. Kim, D. Burger, and S. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches. In Proc. the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. Google ScholarDigital Library
{21} J. Kim, D. Park, C. Nicopoulos, N. Vijaykrishnan, and C. Das. Design and analysis of an NoC architecture from performance, reliability and energy perspective. In Proc. the Symposium on Architecture for Networking and Communications Systems, Oct. 2005. Google ScholarDigital Library
{22} P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32- Way Multithreaded SPARC Processor. IEEE MICRO Magazine , Apr. 2005. Google ScholarDigital Library
{23} G. M. Link and N. Vijaykrishnan. Thermal trends in emergent technologies. In Proc. International Symposium on Quality Electronic Design, 2006. Google ScholarDigital Library
{24} P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, and G. Hallberg. Simics: A full system simulation platform. IEEE Computer, 35(2), Feb. 2002. Google ScholarDigital Library
{25} P. Morrow, M. Kobrinsky, S. Ramanathan, C.-M. Park, M. Harmes, V. Ramachandrarao, H. Park, G. Kloster, S. List, and S. Kim. Wafer-Level 3D Interconnects Via Cu Bonding. In Proc. the 21st Advanced Metallization Conference, Oct. 2004.Google Scholar
{26} R. Mullins, A. West, and S. Moore. Low-latency virtual-channel routers for on-chip networks. In Proc. the 31st Annual International Symposium on Computer Architecture, June 2004. Google ScholarDigital Library
{27} K. Olukotun, B. Nayfeh, L. Hammond, K.Wilson, and K.-Y. Chang. The Case for a Single-Chip Multiprocessor. In Proc. the 7th International Symposium on Architectural Support for Programming Languages and Operating Systems, Oct. 1996. Google ScholarDigital Library
{28} L.-S. Peh and W. Dally. A delay model and speculative architecture for pipelined routers. In The Seventh International Symposium on High-Performance Computer Architecture , Jan. 2001. Google ScholarDigital Library
{29} K. Puttaswamy and G. Loh. Implementing Caches in a 3D Technology for High Performance Processors. In Proc. the International Conference on Computer Design, Oct. 2005. Google ScholarDigital Library
{30} T. Richardson, C. Nicopoulos, D. Park, V. Narayanan, Y. Xie, C. Das, and V. Degalahal. A Hybrid SoC Interconnect with Dynamic TDMA-Based Transaction-Less Buses and On-Chip Networks. In Proc. VLSI Design, 2006. Google ScholarDigital Library
{31} P. Rickert. Problems or opportunities? Beyond the 90nm frontier. ICCAD Keynote Address, 2004.Google ScholarCross Ref
{32} P. Shivakumar and N. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical report, Compaq Computer Corporation, Aug. 2001.Google Scholar
{33} Standard Performance Evaluation Corporation. SPEC OMP. http://www.spec.org/hpg/omp2001/, Dec. 2005.Google Scholar
{34} Sun Microsystems Inc. Sun UltraSPARC T1 Overview. http://www.sun.com/processors/UltraSPARC-T1/, Dec. 2005.Google Scholar
{35} Y.-F. Tsai, Y. Xie, N. Vijaykrishnan, and M. Irwin. Three-Dimensional Cache Design Exploration Using 3D Cacti. In Proc. the International Conference on Computer Design, Oct. 2005. Google ScholarDigital Library
{36} H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: A power-performance simulator for interconnection networks. In Proc. the 35th International Symposium on Microarchitecture , Nov. 2002. Google ScholarDigital Library
{37} A. Young. Perspectives on 3D-IC Technology. Presentation at the 2nd Annual Conference on 3D Architectures for Semiconductor Integration and Packaging, June 2005.Google Scholar
{38} A. Zeng, J. Lu, K. Rose, and R. Gutmann. First-Order Performance Prediction of Cache Memory with Wafer-Level3D Integration. IEEE Design and Test of Computers, 22(6), June 2005. Google ScholarDigital Library
{39} A. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. the 32nd International Symposium on Computer Architecture , 2005. Google ScholarDigital Library

Index Terms

Design and Management of 3D Chip Multiprocessors Using Network-in-Memory
1. Hardware

Recommendations

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Read More
Scalable directory architecture for distributed shared memory chip multiprocessors

Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main ...
Read More
Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches

In this paper, we propose a novel on-chip L2 cache organization for chip multiprocessors (CMPs) with private L2 caches. The proposed approach, called reusability-aware cache sharing (RACS), combines the advantages of both a private L2 cache and a shared ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 34, Issue 2
May 2006
383 pages
ISSN:0163-5964
DOI:10.1145/1150019
Issue’s Table of Contents
ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture
June 2006
383 pages
ISBN:076952608X
Copyright © 2006 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2006
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 288
  Total Citations
  View Citations
- 2,035
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Scalable directory architecture for distributed shared memory chip multiprocessors

Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches