Abstract
This article analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than at the nest level. Researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the SPEC'95 and Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, although most reuse is within a loop nest, in line with popular assertions, most misses are internest capacity misses, and they correspond to potential reuse between nearby loop nests. In addition, we find that temporal and spatial reuse have balanced roles within a loop nest and that most reuse across nests and the entire program is temporal. These results are consistent with high hit rates (80% or more hits), but go against the commonly held assumption that spatial reuse dominates. Our locality measurements reveal important differences between loop nests and programs, refute some popular assertions, and provide new insights for the compiler writer and the architect.
- ABRAHAM,S.G.,SUGUMAR,R.A.,WINDHEISER, D., RAU,B.R.,AND GUPTA, R. 1993. Predictability of load/store instruction latencies. In Proceedings of the 26th Annual International Symposium on Microarchitecture (MICRO 26, Austin, TX, Dec. 1-3), A. Wolfe and W. Mangione-Smith, Eds. IEEE Computer Society Press, Los Alamitos, CA, 139-152. Google Scholar
- AGARWAL,A.AND PUDAR, S. D. 1993. Column-associative caches: A technique for reducing the miss rate of direct-mapped caches. SIGARCH Comput. Arch. News 21, 2 (May), 179-190. Google Scholar
- BAER, J.-L. AND CHEN, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 Conference on Supercomputing (Albuquerque, NM, Nov. 18-22), J. L. Martin, Ed. ACM Press, New York, NY, 176-186. Google Scholar
- BELADY, L. A. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5, 2, 79-101.Google Scholar
- BODIN, F., BECKMAN, P., GANNON, D., GOTWALS, J., NARAYANA, S., SRINIVAS, S., AND WINNICKA, B. 1994. Sage11: An object-oriented toolkit and class library for building Fortran and C11 structuring tools. In Proceedings of the 2nd Annual Object-Oriented Numerics Conference (OON-SKI '94, Sun River, OR, Apr.).Google Scholar
- BURGER, D., GOODMAN,J.R.,AND K~GI, A. 1996. Memory bandwidth limitations of future microprocessors. SIGARCH Comput. Arch. News 24,2,78-89. Google Scholar
- CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990. Improving register allocation for subscripted variables. SIGPLAN Not. 25, 6 (June), 53-65. Google Scholar
- CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991. Software prefetching. SIGARCH Comput. Arch. News 19, 2 (Apr. 1991), 40-52. Google Scholar
- CARR,S.AND KENNEDY, K. 1994. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. Program. Lang. Syst. 16, 6 (Nov.), 1768-1810. Google Scholar
- CHEN,T.F.AND BEAR, J. L. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (May), 609-623. Google Scholar
- COLEMAN,S.AND MCKINLEY, K. S. 1995. Tile size selection using cache organization and data layout. SIGPLAN Not. 30, 6 (June 1995), 279-290. Google Scholar
- COOPER, K., KENNEDY, K., AND MCINTOSH, N. 1995. An emprical study of cross-loop reuse in the NAS benchmarks. Tech. Rep. CRPC-TR95519-S. Center for Research on Parallel Computation, Rice University, Houston, TX.Google Scholar
- COOPER, K., KENNEDY, K., AND MCINTOSH, N. 1996. Cross-loop reuse analysis and its application to cache optimizations. In Proceedings of the 9th Workshop on Languages and Compilers for Parallel Computing (Santa Clara, CA). Google Scholar
- CYBENKO, G., KIPP, L., POINTER, L., AND KUCK, D. 1990. Supercomputer performance evaluation and the Perfect Benchmarks. SIGARCH Comput. Arch. News 18, 3, 254-266. Google Scholar
- DEC. 1994. Alpha 21164 microprocessor, hardware reference manual. Digital Equipment Corp., Maynard, MA.Google Scholar
- DRACH, N. 1995. Hardware implementation issues of data prefetching. In Proceedings of the 9th ACM International Conference on Supercomputing (ICS '95, Barcelona, Spain, July 3-7, 1995), M. Valero, Ed. ACM Press, New York, NY, 245-254. Google Scholar
- GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput. 5, 5 (Oct. 1988), 587-616. Google Scholar
- GEE,J.D.,HILL,M.D.,AND PNEVMATIKATOS, D. N. 1993. Cache performance of the SPEC92 benchmark suite. IEEE Micro 13, 4 (Aug.), 17-27. Google Scholar
- GHOSH, S., MARTONOSI, M., AND MALIK, S. 1998. Precise miss analysis for program transfor-mations with caches of arbitrary associativity. SIGPLAN Not. 33, 11, 228-239.Google Scholar
- HENNESSY,J.L.AND PATTERSON, D. A. 1996. Computer Architecture: A Quantitative Approach. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA. Google Scholar
- HILL, M. D. 1987. Aspects of cache memory and instruction buffer performance. Ph.D. Dissertation. Computer Science Department, University of California at Berkeley, Berke-ley, CA. Google Scholar
- HILL, M. D. 1988. A case for direct-mapped caches. IEEE Computer 21, 12 (Dec. 1988), 25-40. Google Scholar
- HILL,M.D.AND SMITH, A. J. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12 (Dec. 1989), 1612-1631. Google Scholar
- JOUPPI, N. P. 1998. Improving direct-mapped cache performance by the addition of a small fully-associative cache prefetch buffers. In Computer Architecture (ISCA '98), G. S. Sohi, Ed. ACM Press, New York, NY, 388-397. Google Scholar
- KAPLAN,K.R.AND WINDER, R. O. 1973. Cache based computer systems. IEEE Computer 6,3, 30-36.Google Scholar
- KLAIBER,A.C.AND LEVY, H. M. 1991. An architecture for software-controlled data prefetching. SIGARCH Comput. Arch. News 19, 3 (May 1991), 43-53. Google Scholar
- LAM, M., ROTHBERG, E., AND WOLF, M. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV, Santa Clara, CA, Apr. 8-11), D. A. Patterson, Ed. ACM Press, New York, NY, 63-74. Google Scholar
- LI,W.AND PINGALI, K. 1992. Access normalization: Loop restructuring for NUMA compilers. SIGPLAN Not. 27, 9 (Sept. 1992), 285-295. Google Scholar
- MCKEE,S.A.AND WULF, W. A. 1996. A memory controller for improved performance of streamed computations on symmetric multiprocessors. In Proceedings of 25th International Conference on Parallel Processing (Aug.). Google Scholar
- MCKINLEY,K.S.AND TEMAM, O. 1996. A quantitative analysis of loop nest locality. ACM SIGOPS Oper. Syst. Rev. 30, 5, 94-104. MCKINLEY,K.S.,CARR, S., AND TSENG, C.-W. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4 (July), 424-453. Google Scholar
- MOWRY,T.C.,LAM,M.S.,AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V, Boston, MA, Oct. 12-15), S. Eggers, Ed. ACM Press, New York, NY, 62-73. Google Scholar
- PRZYBYLSKI, S., HOROWITZ, M., AND HENNESSY, J. 1988. Performance tradeoffs in cache design. In The 15th Annual International Symposium on Computer Architecture (ISCA '88, Honolulu, HI, May 30-June 2), H. J. Siegel, Ed. IEEE Computer Society Press, Los Alamitos, CA, 290-298. Google Scholar
- REILLY, J. 1995. SPEC describes SPEC'95 product and benchmarks. SPEC Newslett. (Sept.). Available via http://www.spec.org/osg/news/articles/news9509/cpu95descr.html.Google Scholar
- SMITH, A. J. 1982. Cache memories. ACM Comput. Surv. 14, 3 (Sept.), 473-530. Google Scholar
- SMITH, A. J. 1986. Bibliography and readings on CPU cache memories and related topics. SIGARCH Comput. Arch. News 14, 1 (Jan. 1986), 22-42. Google Scholar
- SMITH, A. J. 1987. Line (block) size choice for CPU cache memories. IEEE Trans. Comput. C-36, 9 (Sept. 1987), 1063-1076. Google Scholar
- SMITH, A. J. 1991. Second bibliography on cache memories. SIGARCH Comput. Arch. News 19, 4 (June 1991), 154-182. Google Scholar
- SRIVASTAVA,A.AND EUSTACE, A. 1994. ATOM: a system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN '94 Conference on Programming Language, Design and Implementation (PLDI '94, Orlando, FL, June 20-24, 1994), V. Sarkar, B. Ryder, and M. L. Soffa, Eds. ACM Press, New York, NY, 196-205. Google Scholar
- SUGUMAR,R.A.AND ABRAHAM, S. G. 1993. Efficient simulation of caches under optimal replacement with applications to miss characterization. SIGMETRICS Perform. Eval. Rev. 21, 1 (June 1993), 24-35. Google Scholar
- TEMAM, O., GRANSTON,E.D.,AND JALBY, W. 1993. To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In Proceedings of the Conference on Supercomputing (Supercomputing '93, Portland, OR, Nov. 15-19), B. Borchers and D. Crawford, Eds. IEEE Computer Society Press, Los Alamitos, CA, 410-419. Google Scholar
- TYSON, G., FARRENS, M., MATTHEWS, J., AND PLESZKUN, A. R. 1995. A modified approach to data cache management. In Proceedings of the 28th annual international symposium on Microarchitecture (Ann Arbor, MI, Nov. 29 - Dec. 1, 1995), T. Mudge and K. Ebciog? lu, Eds. IEEE Computer Society Press, Los Alamitos, CA, 93-103. Google Scholar
- WOLF,M.E.AND LAM, M. S. 1991. A data locality optimization algorithm. In Proceedings of the ACM Conference on Programming Language Design and Implementation (SIGPLAN '91, Toronto, Ontario, Canada, June 26-28), D. S. Wise, Ed. ACM Press, New York, NY, 30-44. Google Scholar
- WOLFE, M. 1987. Iteration space titling for memory hieararchies. In Proceedings of the 3rd SIAM Conference on Parallel Processing for Scientific Computing (Dec.) SIAM, Philadel-phia, PA. Google Scholar
- WOOD,D.A.,HILL,M.D.,AND KESSLER, R. E. 1991. A model for estimating trace-sample miss ratios. SIGMETRICS Perform. Eval. Rev. 19, 1 (May 1991), 79-89. Google Scholar
Recommendations
A quantitative analysis of loop nest locality
This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast ...
Timing optimization via nest-loop pipelining considering code size
Embedded systems have strict timing and code size requirements. Software pipelining is one of the most important optimization techniques to improve the execution time of loops by increasing the parallelism among successive loop iterations. However, ...
A quantitative analysis of loop nest locality
ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systemsThis paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast ...
Comments