Abstract
Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on our memory-intensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8% performance improvement over our 3D-stacked memory architecture.
- K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. BioBench: A Benchmark Suite of Bioinformatics Applications. In Proceedings of the Intl. Symp. on Performance Analysis of Systems and Software, pages 2-9, 2005. Google ScholarDigital Library
- AMD Corporation. Barcelona PR Fact Sheet. http://www.amd.com, September 2007.Google Scholar
- T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Micro Magazine, pages 59- 67, February 2002. Google ScholarDigital Library
- R. I. Bahar and S. Manne. Power and Energy Reduction Via Pipeline Balancing. In Proceedings of the 28th Intl. Symp. on Microarchitecture , pages 218-229, 2001. Google ScholarDigital Library
- B. Black, M. M. Annavaram, E. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCauley, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. P. Shen, and C. Webb. Die-Stacking (3D) Microarchitecture. In Proceedings of the 39th Intl. Symp. on Microarchitecture, 2006. Google ScholarDigital Library
- B. H. Bloom. Space/Time Tradeoffs in Hash Coding with Allowable Errors. Communications of the Association for Computing Machinery , 13(7):422-426, July 1970. Google ScholarDigital Library
- V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In Proceedings of the 26th Intl. Symp. on Computer Architecture, pages 222-233, 1999. Google ScholarDigital Library
- S. Das, A. Fan, K.-N. Chen, and C. S. Tan. Technology, Performance, and Computer-Aided Design of Three-Dimensional Integrated Circuits. In Proceedings of the Intl. Symp. on Physical Design, pages 108-115, 2004. Google ScholarDigital Library
- J. Doweck. Inside Intel Core Microarchitecture and Smart Memory Access. White paper, Intel Corporation, 2006. http://download.intel.com/technology/architecture/sma.pdf.Google Scholar
- J. E. Fritts, F. W. Steiling, and J. A. Tucek. MediaBench II Video: Expediting the Next Generation of Video Systems Research. Embedded Processors for Multimedia and Communications II, Proceedings of the SPIE, 5683:79-93, March 2005.Google Scholar
- M. Ghosh and H.-H. S. Lee. Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs. In Proceedings of the 40th Intl. Symp. on Microarchitecture , 2007. Google ScholarDigital Library
- D. Gove. CPU2006 Working Set Size. Computer Architecture News, 35(1):90-96, March 2007. Google ScholarDigital Library
- K. W. Guarini, A. W. Topol, M. Ieong, R. Yu, L. Shi, M. R. Newport, D. J. Frank, D. V. Singh, G. M. Cohen, S. V. Nitta, D. C. Boyd, P. A. O'Neil, S. L. Tempest, H. B. Pogge, S. Purushothaman, and W. E. Haensch. Electrical Integrity of State-of-the-Art 0.13μm SOI CMOS Devices and Circuits Transferred for Three-Dimensional (3D) Integrated Circuit (IC) Fabrication. In Proceedings of the Intl. Electron Devices Meeting, pages 943-945, 2002.Google ScholarCross Ref
- S. Gupta, M. Hilbert, S. Hong, and R. Patti. Techniques for Producing 3D ICs with High-Density Interconnect. In Proceedings of the 21st Intl. VLSI Multilevel Interconnection Conf., 2004.Google Scholar
- M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. MiBench: A Free, Commerically Representative Embedded Benchmark Suite. In Proceedings of the 4th Work. on Workload Characterization, pages 83-94, 2001. Google ScholarDigital Library
- G. Hamerly, E. Perelman, J. Lau, and B. Calder. SimPoint 3.0: Faster and More Flexible Program Analysis. In Proceedings of the Work. on Modeling, Benchmarking and Simulation, 2005.Google Scholar
- H. Hidaka, Y. Matsuda, M. Asakura, and K. Fujishima. The Cache DRAM Architecture. IEEE Micro Magazine, 10(2):14-25, April 1990. Google ScholarDigital Library
- I. Hur and C. Lin. Adaptive History-Based Memory Schedulers. In Proceedings of the 37th Intl. Symp. on Microarchitecture, pages 343- 354, 2004. Google ScholarDigital Library
- Intel Corporation. Introducing the 45nm Next Generation Intel Core Microarchitecture. Technology@IntelMagazine, 4(10), May 2007.Google Scholar
- T. H. Kgil, S. D'Souza, A. G. Saidi, N. Binkert, R. Dreslinski, S. Reinhardt, K. Flautner, and T. Mudge. PicoServer: Using 3D Stacking Technology to Enable a Compact Energy Efficient Chip Multiprocessor. In Proceedings of the 12th Symp. on Architectural Support for Programming Languages and Operating Systems, 2006. Google ScholarDigital Library
- W. Kim, M. S. Gupta, G.-Y. Wei, and D. M. Brooks. Enabling On-Chip Switching Regulators for Multi-Core Processors using Current Staggering. In Proceedings of the Work. on Architectural Support for Gigascale Integration, 2007.Google Scholar
- D. Kroft. Lockup-Free Instruction Fetch/Prefetch Cache Organization. In Proceedings of the 8th Intl. Symp. on Computer Architecture, pages 81-87, 1981. Google ScholarDigital Library
- C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communication Systems. In Proceedings of the 30th Intl. Symp. on Microarchitecture , pages 330-335, 1997. Google ScholarDigital Library
- C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari. Bridging the Processor-Memory Performance Gap with 3D IC Technology. IEEE Design and Test of Computers, 22(6):556-564, November-December 2005. Google ScholarDigital Library
- G. H. Loh, Y. Xie, and B. Black. Processor Design in 3D Die-Stacking Technologies. IEEE Micro Magazine, 27(3), May-June 2007. Google ScholarDigital Library
- G. L. Loi, B. Agarwal, N. Srivastava, S.-C. Lin, and T. Sherwood. A Thermally-Aware Performance Analysis of Vertically Integrated (3- D) Processor-Memory Hierarchy. In Proceedings of the 43rd Design Automation Conf., 2006. Google ScholarDigital Library
- N. Madan and R. Balasubramonian. Leveraging 3D Technology for Improved Reliability. In Proceedings of the 40th Intl. Symp. on Microarchitecture , 2007. Google ScholarDigital Library
- J. D. McCalpin. Stream: Sustainable Memory Bandwidth in High Performance Computers. Technical report, http://www.cs.virginia.edu/stream/.Google Scholar
- S. Mysore, B. Agarwal, S.-C. Lin, N. Srivastava, K. Banerjee, and T. Sherwood. Introspective 3D Chips. In Proceedings of the 12th Symp. on Architectural Support for Programming Languages and Operating Systems, 2006. Google ScholarDigital Library
- D. Nelson, C. Webb, D. McCauley, K. Raol, J. Rupley, J. DeVale, and B. Black. A 3D Interconnect Methodology Applied to iA32-class Architectures for Performance Improvements through RC Mitigation. In Proceedings of the 21st Intl. VLSI Multilevel Interconnection Conf., 2004.Google Scholar
- D. V. Ponomarev, G. Kucuk, and K. Ghose. Dynamic Allocation of Datapath Resources for Low Power. In Proceedings of the Work. on Complexity-Effective Design, Göteborg, Sweden, June 2001.Google Scholar
- K. Puttaswamy and G. H. Loh. Thermal Herding: Microarchitecture Techniques for Controlling HotSpots in High-Performance 3D- Integrated Processors. In Proceedings of the 13th Intl. Symp. on High Performance Computer Architecture, 2007. Google ScholarDigital Library
- M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proceedings of the 39th Intl. Symp. on Microarchitecture , pages 423-432, 2006. Google ScholarDigital Library
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory Access Scheduling. In Proceedings of the 27th Intl. Symp. on Computer Architecture, pages 128-138, 2000. Google ScholarDigital Library
- A. Seznec and P. Michaud. A Case for (Partially) TAgges GEometric History Length Branch Prediction. Journal of Instruction Level Parallelism, 8:1-23, 2006.Google Scholar
- K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan. Temperature-Aware Microarchitecture. In Proceedings of the 30th Intl. Symp. on Computer Architecture, pages 2-13, 2003. Google ScholarDigital Library
- G. S. Sohi and M. Franklin. High-Bandwidth Data Memory Systems for Superscalar Processors. In Proceedings of the 18th Intl. Symp. on Computer Architecture, pages 53-62, 1991. Google ScholarDigital Library
- Tezzaron Semiconductors. Leo FaStack 1Gb DDR SDRAM Datasheet. http://www.tezzaron.com/memory/TSC_Leo.htm, August 2002.Google Scholar
- Tezzaron Semiconductors. Tezzaron Unveils 3D SRAM. Press Release from http://www.tezzaron.com, January 24 2005.Google Scholar
- J. M. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for High Memory Level Parallelism. In Proceedings of the 39th Intl. Symp. on Microarchitecture, 2006. Google ScholarDigital Library
- W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Implications of the Obvious. Computer Architecture News, 23(1):20-24, March 1995. Google ScholarDigital Library
- L. Zhao, R. Iyer, S. Makineni, J. Moses, R. Illikkal, and D. Newell. Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design. In Proceedings of the Work. on Chip Multiprocessor Memory Systems and Interconnects, 2007.Google Scholar
Index Terms
- 3D-Stacked Memory Architectures for Multi-core Processors
Recommendations
3D-Stacked Memory Architectures for Multi-core Processors
ISCA '08: Proceedings of the 35th Annual International Symposium on Computer ArchitectureThree-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works ...
A Memory Access Scheduling Method for Multi-core Processor
IWCSE '09: Proceedings of the 2009 Second International Workshop on Computer Science and Engineering - Volume 01It is well known fact that multi-core processor architecture is the mainstream of the next-generation microprocessor architecture and actualizes by Chip Multi-core Processors (CMP). As the number of cores per processor and the number of threaded ...
Software Controlled Memories for Scalable Many-Core Architectures
RTCSA '12: Proceedings of the 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and ApplicationsTechnology scaling along with the ever evolving demand for media-rich software stacks have motivated the need for many-core platforms. With the increase in compute power and its inherent demand for high memory bandwidth comes the need for vast amounts ...
Comments