article

3D-Stacked Memory Architectures for Multi-core Processors

Author:
Gabriel H. Loh

View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 36 Issue 3June 2008pp 453–464https://doi.org/10.1145/1394608.1382159

Published:01 June 2008Publication History

ACM SIGARCH Computer Architecture News

Abstract

Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on our memory-intensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8% performance improvement over our 3D-stacked memory architecture.

References

K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. BioBench: A Benchmark Suite of Bioinformatics Applications. In Proceedings of the Intl. Symp. on Performance Analysis of Systems and Software, pages 2-9, 2005. Google ScholarDigital Library
AMD Corporation. Barcelona PR Fact Sheet. http://www.amd.com, September 2007.Google Scholar
T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Micro Magazine, pages 59- 67, February 2002. Google ScholarDigital Library
R. I. Bahar and S. Manne. Power and Energy Reduction Via Pipeline Balancing. In Proceedings of the 28th Intl. Symp. on Microarchitecture , pages 218-229, 2001. Google ScholarDigital Library
B. Black, M. M. Annavaram, E. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCauley, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. P. Shen, and C. Webb. Die-Stacking (3D) Microarchitecture. In Proceedings of the 39th Intl. Symp. on Microarchitecture, 2006. Google ScholarDigital Library
B. H. Bloom. Space/Time Tradeoffs in Hash Coding with Allowable Errors. Communications of the Association for Computing Machinery , 13(7):422-426, July 1970. Google ScholarDigital Library
V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In Proceedings of the 26th Intl. Symp. on Computer Architecture, pages 222-233, 1999. Google ScholarDigital Library
S. Das, A. Fan, K.-N. Chen, and C. S. Tan. Technology, Performance, and Computer-Aided Design of Three-Dimensional Integrated Circuits. In Proceedings of the Intl. Symp. on Physical Design, pages 108-115, 2004. Google ScholarDigital Library
J. Doweck. Inside Intel Core Microarchitecture and Smart Memory Access. White paper, Intel Corporation, 2006. http://download.intel.com/technology/architecture/sma.pdf.Google Scholar
J. E. Fritts, F. W. Steiling, and J. A. Tucek. MediaBench II Video: Expediting the Next Generation of Video Systems Research. Embedded Processors for Multimedia and Communications II, Proceedings of the SPIE, 5683:79-93, March 2005.Google Scholar
M. Ghosh and H.-H. S. Lee. Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs. In Proceedings of the 40th Intl. Symp. on Microarchitecture , 2007. Google ScholarDigital Library
D. Gove. CPU2006 Working Set Size. Computer Architecture News, 35(1):90-96, March 2007. Google ScholarDigital Library
K. W. Guarini, A. W. Topol, M. Ieong, R. Yu, L. Shi, M. R. Newport, D. J. Frank, D. V. Singh, G. M. Cohen, S. V. Nitta, D. C. Boyd, P. A. O'Neil, S. L. Tempest, H. B. Pogge, S. Purushothaman, and W. E. Haensch. Electrical Integrity of State-of-the-Art 0.13μm SOI CMOS Devices and Circuits Transferred for Three-Dimensional (3D) Integrated Circuit (IC) Fabrication. In Proceedings of the Intl. Electron Devices Meeting, pages 943-945, 2002.Google ScholarCross Ref
S. Gupta, M. Hilbert, S. Hong, and R. Patti. Techniques for Producing 3D ICs with High-Density Interconnect. In Proceedings of the 21st Intl. VLSI Multilevel Interconnection Conf., 2004.Google Scholar
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. MiBench: A Free, Commerically Representative Embedded Benchmark Suite. In Proceedings of the 4th Work. on Workload Characterization, pages 83-94, 2001. Google ScholarDigital Library
G. Hamerly, E. Perelman, J. Lau, and B. Calder. SimPoint 3.0: Faster and More Flexible Program Analysis. In Proceedings of the Work. on Modeling, Benchmarking and Simulation, 2005.Google Scholar
H. Hidaka, Y. Matsuda, M. Asakura, and K. Fujishima. The Cache DRAM Architecture. IEEE Micro Magazine, 10(2):14-25, April 1990. Google ScholarDigital Library
I. Hur and C. Lin. Adaptive History-Based Memory Schedulers. In Proceedings of the 37th Intl. Symp. on Microarchitecture, pages 343- 354, 2004. Google ScholarDigital Library
Intel Corporation. Introducing the 45nm Next Generation Intel Core Microarchitecture. Technology@IntelMagazine, 4(10), May 2007.Google Scholar
T. H. Kgil, S. D'Souza, A. G. Saidi, N. Binkert, R. Dreslinski, S. Reinhardt, K. Flautner, and T. Mudge. PicoServer: Using 3D Stacking Technology to Enable a Compact Energy Efficient Chip Multiprocessor. In Proceedings of the 12th Symp. on Architectural Support for Programming Languages and Operating Systems, 2006. Google ScholarDigital Library
W. Kim, M. S. Gupta, G.-Y. Wei, and D. M. Brooks. Enabling On-Chip Switching Regulators for Multi-Core Processors using Current Staggering. In Proceedings of the Work. on Architectural Support for Gigascale Integration, 2007.Google Scholar
D. Kroft. Lockup-Free Instruction Fetch/Prefetch Cache Organization. In Proceedings of the 8th Intl. Symp. on Computer Architecture, pages 81-87, 1981. Google ScholarDigital Library
C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communication Systems. In Proceedings of the 30th Intl. Symp. on Microarchitecture , pages 330-335, 1997. Google ScholarDigital Library
C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari. Bridging the Processor-Memory Performance Gap with 3D IC Technology. IEEE Design and Test of Computers, 22(6):556-564, November-December 2005. Google ScholarDigital Library
G. H. Loh, Y. Xie, and B. Black. Processor Design in 3D Die-Stacking Technologies. IEEE Micro Magazine, 27(3), May-June 2007. Google ScholarDigital Library
G. L. Loi, B. Agarwal, N. Srivastava, S.-C. Lin, and T. Sherwood. A Thermally-Aware Performance Analysis of Vertically Integrated (3- D) Processor-Memory Hierarchy. In Proceedings of the 43rd Design Automation Conf., 2006. Google ScholarDigital Library
N. Madan and R. Balasubramonian. Leveraging 3D Technology for Improved Reliability. In Proceedings of the 40th Intl. Symp. on Microarchitecture , 2007. Google ScholarDigital Library
J. D. McCalpin. Stream: Sustainable Memory Bandwidth in High Performance Computers. Technical report, http://www.cs.virginia.edu/stream/.Google Scholar
S. Mysore, B. Agarwal, S.-C. Lin, N. Srivastava, K. Banerjee, and T. Sherwood. Introspective 3D Chips. In Proceedings of the 12th Symp. on Architectural Support for Programming Languages and Operating Systems, 2006. Google ScholarDigital Library
D. Nelson, C. Webb, D. McCauley, K. Raol, J. Rupley, J. DeVale, and B. Black. A 3D Interconnect Methodology Applied to iA32-class Architectures for Performance Improvements through RC Mitigation. In Proceedings of the 21st Intl. VLSI Multilevel Interconnection Conf., 2004.Google Scholar
D. V. Ponomarev, G. Kucuk, and K. Ghose. Dynamic Allocation of Datapath Resources for Low Power. In Proceedings of the Work. on Complexity-Effective Design, Göteborg, Sweden, June 2001.Google Scholar
K. Puttaswamy and G. H. Loh. Thermal Herding: Microarchitecture Techniques for Controlling HotSpots in High-Performance 3D- Integrated Processors. In Proceedings of the 13th Intl. Symp. on High Performance Computer Architecture, 2007. Google ScholarDigital Library
M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proceedings of the 39th Intl. Symp. on Microarchitecture , pages 423-432, 2006. Google ScholarDigital Library
S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory Access Scheduling. In Proceedings of the 27th Intl. Symp. on Computer Architecture, pages 128-138, 2000. Google ScholarDigital Library
A. Seznec and P. Michaud. A Case for (Partially) TAgges GEometric History Length Branch Prediction. Journal of Instruction Level Parallelism, 8:1-23, 2006.Google Scholar
K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan. Temperature-Aware Microarchitecture. In Proceedings of the 30th Intl. Symp. on Computer Architecture, pages 2-13, 2003. Google ScholarDigital Library
G. S. Sohi and M. Franklin. High-Bandwidth Data Memory Systems for Superscalar Processors. In Proceedings of the 18th Intl. Symp. on Computer Architecture, pages 53-62, 1991. Google ScholarDigital Library
Tezzaron Semiconductors. Leo FaStack 1Gb DDR SDRAM Datasheet. http://www.tezzaron.com/memory/TSC_Leo.htm, August 2002.Google Scholar
Tezzaron Semiconductors. Tezzaron Unveils 3D SRAM. Press Release from http://www.tezzaron.com, January 24 2005.Google Scholar
J. M. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for High Memory Level Parallelism. In Proceedings of the 39th Intl. Symp. on Microarchitecture, 2006. Google ScholarDigital Library
W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Implications of the Obvious. Computer Architecture News, 23(1):20-24, March 1995. Google ScholarDigital Library
L. Zhao, R. Iyer, S. Makineni, J. Moses, R. Illikkal, and D. Newell. Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design. In Proceedings of the Work. on Chip Multiprocessor Memory Systems and Interconnects, 2007.Google Scholar

Index Terms

3D-Stacked Memory Architectures for Multi-core Processors
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

3D-Stacked Memory Architectures for Multi-core Processors
ISCA '08: Proceedings of the 35th Annual International Symposium on Computer Architecture

Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works ...
Read More
A Memory Access Scheduling Method for Multi-core Processor
IWCSE '09: Proceedings of the 2009 Second International Workshop on Computer Science and Engineering - Volume 01

It is well known fact that multi-core processor architecture is the mainstream of the next-generation microprocessor architecture and actualizes by Chip Multi-core Processors (CMP). As the number of cores per processor and the number of threaded ...
Read More
Software Controlled Memories for Scalable Many-Core Architectures
RTCSA '12: Proceedings of the 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications

Technology scaling along with the ever evolving demand for media-rich software stacks have motivated the need for many-core platforms. With the increase in compute power and its inherent demand for high memory bandwidth comes the need for vast amounts ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 36, Issue 3
June 2008
449 pages
ISSN:0163-5964
DOI:10.1145/1394608
Issue’s Table of Contents
ISCA '08: Proceedings of the 35th Annual International Symposium on Computer Architecture
June 2008
449 pages
ISBN:9780769531748
Copyright © 2008 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2008
Check for updates
Author Tags
3D integration
memory
multi-core
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 262
  Total Citations
  View Citations
- 4,229
  Total Downloads
- Downloads (Last 12 months)185
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

3D-Stacked Memory Architectures for Multi-core Processors

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

3D-Stacked Memory Architectures for Multi-core Processors

A Memory Access Scheduling Method for Multi-core Processor

Software Controlled Memories for Scalable Many-Core Architectures