article

Spatial Memory Streaming

Authors:
Stephen Somogyi

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Thomas F. Wenisch

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Anastassia Ailamaki

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Babak Falsafi

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Andreas Moshovos

University of Toronto

University of Toronto
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 34 Issue 2May 2006pp 252–263https://doi.org/10.1145/1150019.1136508

Published:01 May 2006Publication History

ACM SIGARCH Computer Architecture News

Abstract

Prior research indicates that there is much spatial variation in applications' memory access patterns. Modern memory systems, however, use small fixed-size cache blocks and as such cannot exploit the variation. Increasing the block size would not only prohibitively increase pin and interconnect bandwidth demands, but also increase the likelihood of false sharing in shared-memory multiprocessors. In this paper, we show that memory accesses in commercial workloads often exhibit repetitive layouts that span large memory regions (e.g., several kB), and these accesses recur in patterns that are predictable through codebased correlation. We propose Spatial Memory Streaming, a practical on-chip hardware technique that identifies codecorrelated spatial access patterns and streams predicted blocks to the primary cache ahead of demand misses. Using cycle-accurate full-system multiprocessor simulation of commercial and scientific applications, we demonstrate that Spatial Memory Streaming can on average predict 58% of L1 and 65% of off-chip misses, for a mean performance improvement of 37% and at best 307%.

References

{1} S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66-76, Dec. 1996. Google ScholarDigital Library
{2} A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In The VLDB Journal, Sep. 1999. Google ScholarDigital Library
{3} L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the 25th International Symposium on Computer Architecture, June 1998. Google ScholarDigital Library
{4} C. F. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos. Accurate and complexity-effective spatial pattern prediction. In Proceedings of the Tenth Symposium on High-Performance Computer Architecture, Feb. 2004. Google ScholarDigital Library
{5} S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry. Improving hash join performance through prefetching. In Proceedings of the 20th International Conference on Data Engineering, Apr. 2004. Google ScholarDigital Library
{6} Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st International Symposium on Computer Architecture, June 2004. Google ScholarDigital Library
{7} Z. Cvetanovic. Performance analysis of the Alpha 21364-based HP GS1280 multiprocessor. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003. Google ScholarDigital Library
{8} C. Dubnicki and T. J. LeBlanc. Adjustable block size coherence caches. In Proceedings of the 19th International Symposium on Computer Architecture, June 1992. Google ScholarDigital Library
{9} K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing, Aug. 1991.Google Scholar
{10} C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In Proceedings of the 26th International Symposium on Computer Architecture, May 1999. Google ScholarDigital Library
{11} A. Gonzalez, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In International Conference on Supercomputing , July 1995. Google ScholarDigital Library
{12} D. Gracia Perez, G. Mouchard, and O. Temam. MicroLib: A case for the quantitative comparison of micro-architecture mechanisms. In Proceedings of the 37th International Symposium on Microarchitecture, Dec. 2004. Google ScholarDigital Library
{13} R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J. P. Shen. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the 36th International Symposium on Microarchitecture, Dec. 2003. Google ScholarDigital Library
{14} N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. SimFlex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. SIGMETRICS Performance Evaluation Review, 31(4):31-35, Apr. 2004. Google ScholarDigital Library
{15} J. Huh, J. Chang, D. Burger, and G. S. Sohi. Coherence decoupling: making use of incoherence. In Proceedings of the Eleventh International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2004. Google ScholarDigital Library
{16} T. Johnson, M. Merten, and W.-M. Hwu. Run-time spatial locality detection and optimization. In Proceedings of the 31st International Symposium on Microarchitecture , Nov. 1998. Google ScholarDigital Library
{17} S. Kumar and C. Wilkerson. Exploiting spatial locality in data caches using spatial footprints. In Proceedings of the 25th International Symposium on Computer Architecture , June 1998. Google ScholarDigital Library
{18} A.-C. Lai and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture, July 2001. Google ScholarDigital Library
{19} O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: an effective alternative to large instruction windows. IEEE Micro, 23(6):20-25, Nov./ Dec. 2003. Google ScholarDigital Library
{20} K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the Tenth Symposium on High-Performance Computer Architecture , Feb. 2004. Google ScholarDigital Library
{21} P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1998. Google ScholarDigital Library
{22} A. Seznec. Decoupled sectored caches. In IEEE Transactions on Computers, 46(2):210-215, 1997. Google ScholarDigital Library
{23} M. Shao, A. Ailamaki, and B. Falsafi. DBmbench: Fast and accurate database workload representation on modern microarchitecture. In Proceedings of the 15th IBM Center for Advanced Studies Conference, Oct. 2005. Google ScholarDigital Library
{24} T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd International Symposium on Microarchitecture, Dec. 2000. Google ScholarDigital Library
{25} Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002. Google ScholarDigital Library
{26} P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torellas. The memory performance of DSS commercial workloads in shared-memory multiprocessors. In Proceedings of the Third Symposium on High-Performance Computer Architecture, Feb. 1997. Google ScholarDigital Library
{27} A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache line size to application behavior. In International Conference on Supercomputing, July 1999. Google ScholarDigital Library
{28} P. V. Vleet, E. Anderson, L. Brown, J.-L. Bear, and A. Karlin. Pursuing the performance potential of dynamic cache line sizes. In International Conference on Computer Design, Oct. 1999. Google ScholarDigital Library
{29} Z. Wang, D. Burger, K. S. McKinley, S. K. Reinhardt, and C. C. Weems. Guided region prefetching: a cooperative hardware/software approach. In Proceedings of the 30th International Symposium on Computer Architecture , June 2003. Google ScholarDigital Library
{30} T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the 32nd International Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
{31} T. F. Wenisch, R. E. Wunderlich, B. Falsafi, and J. C. Hoe. Simulation sampling with live-points. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, June 2006.Google ScholarCross Ref
{32} R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation through rigorous statistical sampling. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003. Google ScholarDigital Library

Index Terms

Spatial Memory Streaming
1. Hardware

Recommendations

Spatio-temporal memory streaming

Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of ...
Read More
Spatial Memory Streaming
ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture

Prior research indicates that there is much spatial variation in applications' memory access patterns. Modern memory systems, however, use small fixed-size cache blocks and as such cannot exploit the variation. Increasing the block size would not only ...
Read More
Spatio-temporal memory streaming
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 34, Issue 2
May 2006
383 pages
ISSN:0163-5964
DOI:10.1145/1150019
Issue’s Table of Contents
ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture
June 2006
383 pages
ISBN:076952608X
Copyright © 2006 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2006
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 132
  Total Citations
  View Citations
- 1,416
  Total Downloads
- Downloads (Last 12 months)96
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Spatial Memory Streaming

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Spatio-temporal memory streaming

Spatial Memory Streaming

Spatio-temporal memory streaming