Abstract
Prior research indicates that there is much spatial variation in applications' memory access patterns. Modern memory systems, however, use small fixed-size cache blocks and as such cannot exploit the variation. Increasing the block size would not only prohibitively increase pin and interconnect bandwidth demands, but also increase the likelihood of false sharing in shared-memory multiprocessors. In this paper, we show that memory accesses in commercial workloads often exhibit repetitive layouts that span large memory regions (e.g., several kB), and these accesses recur in patterns that are predictable through codebased correlation. We propose Spatial Memory Streaming, a practical on-chip hardware technique that identifies codecorrelated spatial access patterns and streams predicted blocks to the primary cache ahead of demand misses. Using cycle-accurate full-system multiprocessor simulation of commercial and scientific applications, we demonstrate that Spatial Memory Streaming can on average predict 58% of L1 and 65% of off-chip misses, for a mean performance improvement of 37% and at best 307%.
- {1} S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66-76, Dec. 1996. Google ScholarDigital Library
- {2} A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In The VLDB Journal, Sep. 1999. Google ScholarDigital Library
- {3} L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the 25th International Symposium on Computer Architecture, June 1998. Google ScholarDigital Library
- {4} C. F. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos. Accurate and complexity-effective spatial pattern prediction. In Proceedings of the Tenth Symposium on High-Performance Computer Architecture, Feb. 2004. Google ScholarDigital Library
- {5} S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry. Improving hash join performance through prefetching. In Proceedings of the 20th International Conference on Data Engineering, Apr. 2004. Google ScholarDigital Library
- {6} Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st International Symposium on Computer Architecture, June 2004. Google ScholarDigital Library
- {7} Z. Cvetanovic. Performance analysis of the Alpha 21364-based HP GS1280 multiprocessor. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003. Google ScholarDigital Library
- {8} C. Dubnicki and T. J. LeBlanc. Adjustable block size coherence caches. In Proceedings of the 19th International Symposium on Computer Architecture, June 1992. Google ScholarDigital Library
- {9} K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing, Aug. 1991.Google Scholar
- {10} C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In Proceedings of the 26th International Symposium on Computer Architecture, May 1999. Google ScholarDigital Library
- {11} A. Gonzalez, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In International Conference on Supercomputing , July 1995. Google ScholarDigital Library
- {12} D. Gracia Perez, G. Mouchard, and O. Temam. MicroLib: A case for the quantitative comparison of micro-architecture mechanisms. In Proceedings of the 37th International Symposium on Microarchitecture, Dec. 2004. Google ScholarDigital Library
- {13} R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J. P. Shen. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the 36th International Symposium on Microarchitecture, Dec. 2003. Google ScholarDigital Library
- {14} N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. SimFlex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. SIGMETRICS Performance Evaluation Review, 31(4):31-35, Apr. 2004. Google ScholarDigital Library
- {15} J. Huh, J. Chang, D. Burger, and G. S. Sohi. Coherence decoupling: making use of incoherence. In Proceedings of the Eleventh International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2004. Google ScholarDigital Library
- {16} T. Johnson, M. Merten, and W.-M. Hwu. Run-time spatial locality detection and optimization. In Proceedings of the 31st International Symposium on Microarchitecture , Nov. 1998. Google ScholarDigital Library
- {17} S. Kumar and C. Wilkerson. Exploiting spatial locality in data caches using spatial footprints. In Proceedings of the 25th International Symposium on Computer Architecture , June 1998. Google ScholarDigital Library
- {18} A.-C. Lai and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture, July 2001. Google ScholarDigital Library
- {19} O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: an effective alternative to large instruction windows. IEEE Micro, 23(6):20-25, Nov./ Dec. 2003. Google ScholarDigital Library
- {20} K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the Tenth Symposium on High-Performance Computer Architecture , Feb. 2004. Google ScholarDigital Library
- {21} P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1998. Google ScholarDigital Library
- {22} A. Seznec. Decoupled sectored caches. In IEEE Transactions on Computers, 46(2):210-215, 1997. Google ScholarDigital Library
- {23} M. Shao, A. Ailamaki, and B. Falsafi. DBmbench: Fast and accurate database workload representation on modern microarchitecture. In Proceedings of the 15th IBM Center for Advanced Studies Conference, Oct. 2005. Google ScholarDigital Library
- {24} T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd International Symposium on Microarchitecture, Dec. 2000. Google ScholarDigital Library
- {25} Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002. Google ScholarDigital Library
- {26} P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torellas. The memory performance of DSS commercial workloads in shared-memory multiprocessors. In Proceedings of the Third Symposium on High-Performance Computer Architecture, Feb. 1997. Google ScholarDigital Library
- {27} A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache line size to application behavior. In International Conference on Supercomputing, July 1999. Google ScholarDigital Library
- {28} P. V. Vleet, E. Anderson, L. Brown, J.-L. Bear, and A. Karlin. Pursuing the performance potential of dynamic cache line sizes. In International Conference on Computer Design, Oct. 1999. Google ScholarDigital Library
- {29} Z. Wang, D. Burger, K. S. McKinley, S. K. Reinhardt, and C. C. Weems. Guided region prefetching: a cooperative hardware/software approach. In Proceedings of the 30th International Symposium on Computer Architecture , June 2003. Google ScholarDigital Library
- {30} T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the 32nd International Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- {31} T. F. Wenisch, R. E. Wunderlich, B. Falsafi, and J. C. Hoe. Simulation sampling with live-points. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, June 2006.Google ScholarCross Ref
- {32} R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation through rigorous statistical sampling. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003. Google ScholarDigital Library
Index Terms
- Spatial Memory Streaming
Recommendations
Spatio-temporal memory streaming
Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of ...
Spatial Memory Streaming
ISCA '06: Proceedings of the 33rd annual international symposium on Computer ArchitecturePrior research indicates that there is much spatial variation in applications' memory access patterns. Modern memory systems, however, use small fixed-size cache blocks and as such cannot exploit the variation. Increasing the block size would not only ...
Spatio-temporal memory streaming
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureRecent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of ...
Comments