ABSTRACT
The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single cache port by using additional buffering in the processor, and by taking maximum advantage of a wider cache port. We evaluate these techniques using realistic applications that include the operating system. Our techniques using a single-ported cache achieve 91% of the performance of a dual-ported cache.
- Aspr93.Tom Asprey, Gregory S. AveriI1, Eric DeLano, Russ Mason, Bill Weiner, and Jeff Yetter, "Performance Features of the PA7100 Microprocessor", IEEE Micro, June 1993, pp. 22-35. Google ScholarDigital Library
- Benn95.James Bennett and Mike Flynn, "Performance Factors for Superscalar Processors", Technical Report CSL-TR-95-661, Computer Systems Laboratory, Stanford University, Feb. 1995. Google ScholarDigital Library
- Chap91.Terry I. Chappell, Barbara A. Chappell, Stanley E. Schuster, James W. Allen, Stephen P. Klepner, Rajiv V. Joshi, and Robert L. Franch, "A 2-ns Cycle, 3.8- ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture", IEEE Journal of Solid-State Circuits, VoI. 26, No. 11, November 1991, pp. 1577-1585.Google ScholarCross Ref
- Chen92.Tien-Fu Chen and Jean-Loup Baer, "Reducing Memory Latency via Nonblocking and Prefetching Caches", ASPLOS-V, Boston, Massachusetts, October 12- 15, 1992. Google ScholarDigital Library
- Chen94.Chung-Ho Chen and Arun K. Somani, "A Unified Architectural Tradeoff Methodology", ISCA-21, Chicago, Illinois, April 18-21, 1994, pp. 348-357. Google ScholarDigital Library
- Conte92.Thomas A. Conte, "Tradeoffs in Processor/Memory interfaces for Superscalar Processors, Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Or 1992. Google ScholarDigital Library
- Cvet94.Zarka Cvetanovic and Dileep Bhandarkar, "Characterization of Alpha AXP Performance Using TP and SPEC Workloads, The 21~t Annual International Symposium on Computer Architecture, April 18-2I, 1994, pp. 60-70. Google ScholarDigital Library
- Fark94.Keith I. Farkas and Norman P. Jouppi, "Complexity/Performance Tradeoffs with Non-Blocking Loads", ISCA-2I, Chicago, Illinois, April 18-21, 1994, pp. 211- 222. Google ScholarDigital Library
- Farr94.Mathew Farrens, Gary Tyson, and Andrew R. Pleszkun, "A Study of Single- Chip Processor/Cache Organizations for Large Numbers of Transistors", ISCA-21, Chicago, Illinois, April 18-21, 1994, pp. 338-347. Google ScholarDigital Library
- Gee93.Jeffrey D. Gee, Mark D. Hill, Dionisios N. Pnevmatikatos, and Alan Jay Smith, "Cache Performance of the SPEC92 Benchmark Suite", IEEE Micro, August 1993, pp. 17-27. Google ScholarDigital Library
- Gray93.Jim Gray, Ed., "The Benchmark Handbook for Database and Transaction Prossing System" , Morgan Kaufman Publishers, 1993. Google ScholarDigital Library
- Gwen94.Linley Gwennap, "MIPS R 10000 Uses Decoupled Architecture", Mxcroprocessor Report, Volume 8, Number 14, October 24, 1994, pp 18-22.Google Scholar
- Henn90.John L. Hennessy and David A. Patterson, "Computer Architecture a Quantitative Approach", Morgan Kaufmann Publishers, Inc, 1990. Google ScholarDigital Library
- John91.Mike Johnson, "Superscalar Microprocessor Design", Prentice.Hall Inc, 1991.Google Scholar
- Joup90.Norman P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prfetch buffers", Proc 17th Annual Int'l Symposium on Computer Architecture (Cat. No. 90CH2887-8), IEEE Computer Society Press, Los Alamitos, CA, Seattle. May 28.31, 1990, pp. 364-373. Google ScholarDigital Library
- Joup93.Norman P. Jouppi, "Cache Write Policies and Performance", ISCA-20, San Diego, Callforma, May 16-19, 1993. Google ScholarDigital Library
- Krof81.David Kroft, "Lockup-Free Instruction Fetch/Prefetch Cache Organization", ISCA-8, 1993 pp. 81-87. Google ScholarDigital Library
- Kusk94.Jeff Kuskin, David Ofelt, Mark Heinnch, John Heinlein, Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John L. Hennessy, "The Stanford FLASH multiprocessor", Proceedings of the 21st International Symposium on Compu(er Architecrare, pp. 302-313, April 1994. Google ScholarDigital Library
- Mayn94.Ann Marie Grizzaffi Maynard, Colette M. Donnelly, and Bret R. Olszewski, "Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads", ASPLOS-VI, San Jose, CA, October 4-7, 1994.Google ScholarDigital Library
- McLe93.Edward McLellan, "The Alpha AXP Architecture and 21064 F'rocessor", IEEE Micro, June 1993, pp. 36-47. Google ScholarDigital Library
- Rose95.Mendel Rosenblum, Edouard Bugnion, Stephen Alan Herrod, Emmett WitcheI, and Anoop Gupta, "The Impact of Architectural Trends on Operating System Performance", To Appear in The 15th ACM Symposium on Operating Systems Principles, Copper Mountain Resort, Colorado, Dec. 3-6, 1995. Google ScholarDigital Library
- Rose95b.Mendel Rosenblum, Stephen A. Herrod, Emmett Wltchel, and Anoop Gupta, "Complete Computer System Simulation: The SimOS Approach", IEEE Parallel and Distrubuted Technology, Volume 3, Number 4, Fall 1995. Google ScholarDigital Library
- MIPS94.MIPS Technologies, Incorporated, "R10000 Microprocessor Product Overwew", MIPS Open RISC Technology, MIPS Technologies, incorporated, October 1994.Google Scholar
- NEC94.NEC Corporation, "16M bit Synchronous DRAM, prelinunary data sheet", NEC Corporation, March 1994.Google Scholar
- Oluk92.Kunle Olukotun, Trevor Mudge, and Richard Brown, "Performance Optimization of Pipelined Primary Caches", ISCA-19, Gold Coast, Australia, May 19-21, 1992, pp 181-190 Google ScholarDigital Library
- Przy88.Przybylski, S., M. Horowitz, and J. Hennessy, "Performance Tradeoffs m Cache Design", Proceedings of the 15th Annual International Symposium on Computer Architecture, June 1988. pp 290-298. Google ScholarDigital Library
- Rau93.B. Ramakrishna Rau and Joseph A. Fisher, "Instructaon-Level PaJ:allel Processing: History, Overview, and Perspective", Journal of Supercomputing, 7, 1993, pp. 9-50. Google ScholarDigital Library
- Sohi91.Gurindar S. Sohi and Manoj Franklin, "High-Bandwidth Data Memory Systems for Superscalar Processors", ASPLOS-IV, Santa Clara, CA, Apnl 8-I 1, 1991. Google ScholarDigital Library
- SPEC95.SPEC, "SPEC Benchmark Specifications - 101 .tomcatv", SPEC95 benchmarks release, 1995.Google Scholar
- Toma67.Tomasulo, R. M., "An Efficient Algorithm for Exploiting Multiple Arithmetic Units.", IBM Journal of Research and Development, Vol. 11 (January 1967), pp. 25-33.Google ScholarDigital Library
- Uht86.Uht, A K., "An Efficient Hardware Algorithm to Extract Concum~ncy from General Purpose Code", Proceedings of the Nineteenth Annual Hawaii International Conference on System Sciences, 1986, pp. 41-50.Google Scholar
- Upto94.Michael Upton, Thomas Huff, Trevor Mudge, and Richard Brown, "Resource Allocation m a High Clock Rate Microprocessor", ASPLOS-VI, San Jose, CA, October 4-7, 1994, pp. 98-109 Google ScholarDigital Library
- Wall93.David W. Wall, "Limits of Instruction-Level Parallelism", WRL Research Report 93/6, Western Research Laboratory, 250 University Ave., Palo Alto, CA,Google Scholar
- Wilt94.Steven J. E. Wilton and Norman P. Jouppi, "An Enhanced Access and Cycle Time Model for On-Chip Caches", WRL Research Report 93/5, Western Research Laboratory, 250 University Ave., Palo Alto, CA, 94301Google Scholar
- Witc96.Emmett Witchel and Mendel Rosenblum, "Embra: Fast and Flexible Machine Simulation", To appear in the Proceedings of ACM SIGMETRICS '96: Conference on Measurement and Modeling of Computer Systems, Philadelphia, PA, 1996 Google ScholarDigital Library
Index Terms
- Increasing cache port efficiency for dynamic superscalar microprocessors
Recommendations
Increasing cache port efficiency for dynamic superscalar microprocessors
Special Issue: Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96)The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single ...
Comments