skip to main content
article
Free Access

Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

Published:01 November 1999Publication History
Skip Abstract Section

Abstract

This article analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than at the nest level. Researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the SPEC'95 and Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, although most reuse is within a loop nest, in line with popular assertions, most misses are internest capacity misses, and they correspond to potential reuse between nearby loop nests. In addition, we find that temporal and spatial reuse have balanced roles within a loop nest and that most reuse across nests and the entire program is temporal. These results are consistent with high hit rates (80% or more hits), but go against the commonly held assumption that spatial reuse dominates. Our locality measurements reveal important differences between loop nests and programs, refute some popular assertions, and provide new insights for the compiler writer and the architect.

References

  1. ABRAHAM,S.G.,SUGUMAR,R.A.,WINDHEISER, D., RAU,B.R.,AND GUPTA, R. 1993. Predictability of load/store instruction latencies. In Proceedings of the 26th Annual International Symposium on Microarchitecture (MICRO 26, Austin, TX, Dec. 1-3), A. Wolfe and W. Mangione-Smith, Eds. IEEE Computer Society Press, Los Alamitos, CA, 139-152. Google ScholarGoogle Scholar
  2. AGARWAL,A.AND PUDAR, S. D. 1993. Column-associative caches: A technique for reducing the miss rate of direct-mapped caches. SIGARCH Comput. Arch. News 21, 2 (May), 179-190. Google ScholarGoogle Scholar
  3. BAER, J.-L. AND CHEN, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 Conference on Supercomputing (Albuquerque, NM, Nov. 18-22), J. L. Martin, Ed. ACM Press, New York, NY, 176-186. Google ScholarGoogle Scholar
  4. BELADY, L. A. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5, 2, 79-101.Google ScholarGoogle Scholar
  5. BODIN, F., BECKMAN, P., GANNON, D., GOTWALS, J., NARAYANA, S., SRINIVAS, S., AND WINNICKA, B. 1994. Sage11: An object-oriented toolkit and class library for building Fortran and C11 structuring tools. In Proceedings of the 2nd Annual Object-Oriented Numerics Conference (OON-SKI '94, Sun River, OR, Apr.).Google ScholarGoogle Scholar
  6. BURGER, D., GOODMAN,J.R.,AND K~GI, A. 1996. Memory bandwidth limitations of future microprocessors. SIGARCH Comput. Arch. News 24,2,78-89. Google ScholarGoogle Scholar
  7. CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990. Improving register allocation for subscripted variables. SIGPLAN Not. 25, 6 (June), 53-65. Google ScholarGoogle Scholar
  8. CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991. Software prefetching. SIGARCH Comput. Arch. News 19, 2 (Apr. 1991), 40-52. Google ScholarGoogle Scholar
  9. CARR,S.AND KENNEDY, K. 1994. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. Program. Lang. Syst. 16, 6 (Nov.), 1768-1810. Google ScholarGoogle Scholar
  10. CHEN,T.F.AND BEAR, J. L. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (May), 609-623. Google ScholarGoogle Scholar
  11. COLEMAN,S.AND MCKINLEY, K. S. 1995. Tile size selection using cache organization and data layout. SIGPLAN Not. 30, 6 (June 1995), 279-290. Google ScholarGoogle Scholar
  12. COOPER, K., KENNEDY, K., AND MCINTOSH, N. 1995. An emprical study of cross-loop reuse in the NAS benchmarks. Tech. Rep. CRPC-TR95519-S. Center for Research on Parallel Computation, Rice University, Houston, TX.Google ScholarGoogle Scholar
  13. COOPER, K., KENNEDY, K., AND MCINTOSH, N. 1996. Cross-loop reuse analysis and its application to cache optimizations. In Proceedings of the 9th Workshop on Languages and Compilers for Parallel Computing (Santa Clara, CA). Google ScholarGoogle Scholar
  14. CYBENKO, G., KIPP, L., POINTER, L., AND KUCK, D. 1990. Supercomputer performance evaluation and the Perfect Benchmarks. SIGARCH Comput. Arch. News 18, 3, 254-266. Google ScholarGoogle Scholar
  15. DEC. 1994. Alpha 21164 microprocessor, hardware reference manual. Digital Equipment Corp., Maynard, MA.Google ScholarGoogle Scholar
  16. DRACH, N. 1995. Hardware implementation issues of data prefetching. In Proceedings of the 9th ACM International Conference on Supercomputing (ICS '95, Barcelona, Spain, July 3-7, 1995), M. Valero, Ed. ACM Press, New York, NY, 245-254. Google ScholarGoogle Scholar
  17. GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput. 5, 5 (Oct. 1988), 587-616. Google ScholarGoogle Scholar
  18. GEE,J.D.,HILL,M.D.,AND PNEVMATIKATOS, D. N. 1993. Cache performance of the SPEC92 benchmark suite. IEEE Micro 13, 4 (Aug.), 17-27. Google ScholarGoogle Scholar
  19. GHOSH, S., MARTONOSI, M., AND MALIK, S. 1998. Precise miss analysis for program transfor-mations with caches of arbitrary associativity. SIGPLAN Not. 33, 11, 228-239.Google ScholarGoogle Scholar
  20. HENNESSY,J.L.AND PATTERSON, D. A. 1996. Computer Architecture: A Quantitative Approach. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA. Google ScholarGoogle Scholar
  21. HILL, M. D. 1987. Aspects of cache memory and instruction buffer performance. Ph.D. Dissertation. Computer Science Department, University of California at Berkeley, Berke-ley, CA. Google ScholarGoogle Scholar
  22. HILL, M. D. 1988. A case for direct-mapped caches. IEEE Computer 21, 12 (Dec. 1988), 25-40. Google ScholarGoogle Scholar
  23. HILL,M.D.AND SMITH, A. J. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12 (Dec. 1989), 1612-1631. Google ScholarGoogle Scholar
  24. JOUPPI, N. P. 1998. Improving direct-mapped cache performance by the addition of a small fully-associative cache prefetch buffers. In Computer Architecture (ISCA '98), G. S. Sohi, Ed. ACM Press, New York, NY, 388-397. Google ScholarGoogle Scholar
  25. KAPLAN,K.R.AND WINDER, R. O. 1973. Cache based computer systems. IEEE Computer 6,3, 30-36.Google ScholarGoogle Scholar
  26. KLAIBER,A.C.AND LEVY, H. M. 1991. An architecture for software-controlled data prefetching. SIGARCH Comput. Arch. News 19, 3 (May 1991), 43-53. Google ScholarGoogle Scholar
  27. LAM, M., ROTHBERG, E., AND WOLF, M. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV, Santa Clara, CA, Apr. 8-11), D. A. Patterson, Ed. ACM Press, New York, NY, 63-74. Google ScholarGoogle Scholar
  28. LI,W.AND PINGALI, K. 1992. Access normalization: Loop restructuring for NUMA compilers. SIGPLAN Not. 27, 9 (Sept. 1992), 285-295. Google ScholarGoogle Scholar
  29. MCKEE,S.A.AND WULF, W. A. 1996. A memory controller for improved performance of streamed computations on symmetric multiprocessors. In Proceedings of 25th International Conference on Parallel Processing (Aug.). Google ScholarGoogle Scholar
  30. MCKINLEY,K.S.AND TEMAM, O. 1996. A quantitative analysis of loop nest locality. ACM SIGOPS Oper. Syst. Rev. 30, 5, 94-104. MCKINLEY,K.S.,CARR, S., AND TSENG, C.-W. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4 (July), 424-453. Google ScholarGoogle Scholar
  31. MOWRY,T.C.,LAM,M.S.,AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V, Boston, MA, Oct. 12-15), S. Eggers, Ed. ACM Press, New York, NY, 62-73. Google ScholarGoogle Scholar
  32. PRZYBYLSKI, S., HOROWITZ, M., AND HENNESSY, J. 1988. Performance tradeoffs in cache design. In The 15th Annual International Symposium on Computer Architecture (ISCA '88, Honolulu, HI, May 30-June 2), H. J. Siegel, Ed. IEEE Computer Society Press, Los Alamitos, CA, 290-298. Google ScholarGoogle Scholar
  33. REILLY, J. 1995. SPEC describes SPEC'95 product and benchmarks. SPEC Newslett. (Sept.). Available via http://www.spec.org/osg/news/articles/news9509/cpu95descr.html.Google ScholarGoogle Scholar
  34. SMITH, A. J. 1982. Cache memories. ACM Comput. Surv. 14, 3 (Sept.), 473-530. Google ScholarGoogle Scholar
  35. SMITH, A. J. 1986. Bibliography and readings on CPU cache memories and related topics. SIGARCH Comput. Arch. News 14, 1 (Jan. 1986), 22-42. Google ScholarGoogle Scholar
  36. SMITH, A. J. 1987. Line (block) size choice for CPU cache memories. IEEE Trans. Comput. C-36, 9 (Sept. 1987), 1063-1076. Google ScholarGoogle Scholar
  37. SMITH, A. J. 1991. Second bibliography on cache memories. SIGARCH Comput. Arch. News 19, 4 (June 1991), 154-182. Google ScholarGoogle Scholar
  38. SRIVASTAVA,A.AND EUSTACE, A. 1994. ATOM: a system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN '94 Conference on Programming Language, Design and Implementation (PLDI '94, Orlando, FL, June 20-24, 1994), V. Sarkar, B. Ryder, and M. L. Soffa, Eds. ACM Press, New York, NY, 196-205. Google ScholarGoogle Scholar
  39. SUGUMAR,R.A.AND ABRAHAM, S. G. 1993. Efficient simulation of caches under optimal replacement with applications to miss characterization. SIGMETRICS Perform. Eval. Rev. 21, 1 (June 1993), 24-35. Google ScholarGoogle Scholar
  40. TEMAM, O., GRANSTON,E.D.,AND JALBY, W. 1993. To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In Proceedings of the Conference on Supercomputing (Supercomputing '93, Portland, OR, Nov. 15-19), B. Borchers and D. Crawford, Eds. IEEE Computer Society Press, Los Alamitos, CA, 410-419. Google ScholarGoogle Scholar
  41. TYSON, G., FARRENS, M., MATTHEWS, J., AND PLESZKUN, A. R. 1995. A modified approach to data cache management. In Proceedings of the 28th annual international symposium on Microarchitecture (Ann Arbor, MI, Nov. 29 - Dec. 1, 1995), T. Mudge and K. Ebciog? lu, Eds. IEEE Computer Society Press, Los Alamitos, CA, 93-103. Google ScholarGoogle Scholar
  42. WOLF,M.E.AND LAM, M. S. 1991. A data locality optimization algorithm. In Proceedings of the ACM Conference on Programming Language Design and Implementation (SIGPLAN '91, Toronto, Ontario, Canada, June 26-28), D. S. Wise, Ed. ACM Press, New York, NY, 30-44. Google ScholarGoogle Scholar
  43. WOLFE, M. 1987. Iteration space titling for memory hieararchies. In Proceedings of the 3rd SIAM Conference on Parallel Processing for Scientific Computing (Dec.) SIAM, Philadel-phia, PA. Google ScholarGoogle Scholar
  44. WOOD,D.A.,HILL,M.D.,AND KESSLER, R. E. 1991. A model for estimating trace-sample miss ratios. SIGMETRICS Perform. Eval. Rev. 19, 1 (May 1991), 79-89. Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Transactions on Computer Systems
    ACM Transactions on Computer Systems  Volume 17, Issue 4
    Nov. 1999
    123 pages
    ISSN:0734-2071
    EISSN:1557-7333
    DOI:10.1145/329466
    Issue’s Table of Contents

    Copyright © 1999 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 November 1999
    Published in tocs Volume 17, Issue 4

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader