article

Free Access

Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

Authors:
Kathryn S. McKinley

Univ. of Massachusetts, Amherst

Univ. of Massachusetts, Amherst
View Profile

,
Olivier Temam

Paris XI Univ., Orsay, France

Paris XI Univ., Orsay, France
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 17 Issue 4pp 288–336https://doi.org/10.1145/329466.329484

Published:01 November 1999Publication History

ACM Transactions on Computer Systems

Abstract

This article analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than at the nest level. Researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the SPEC'95 and Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, although most reuse is within a loop nest, in line with popular assertions, most misses are internest capacity misses, and they correspond to potential reuse between nearby loop nests. In addition, we find that temporal and spatial reuse have balanced roles within a loop nest and that most reuse across nests and the entire program is temporal. These results are consistent with high hit rates (80% or more hits), but go against the commonly held assumption that spatial reuse dominates. Our locality measurements reveal important differences between loop nests and programs, refute some popular assertions, and provide new insights for the compiler writer and the architect.

References

ABRAHAM,S.G.,SUGUMAR,R.A.,WINDHEISER, D., RAU,B.R.,AND GUPTA, R. 1993. Predictability of load/store instruction latencies. In Proceedings of the 26th Annual International Symposium on Microarchitecture (MICRO 26, Austin, TX, Dec. 1-3), A. Wolfe and W. Mangione-Smith, Eds. IEEE Computer Society Press, Los Alamitos, CA, 139-152. Google Scholar
AGARWAL,A.AND PUDAR, S. D. 1993. Column-associative caches: A technique for reducing the miss rate of direct-mapped caches. SIGARCH Comput. Arch. News 21, 2 (May), 179-190. Google Scholar
BAER, J.-L. AND CHEN, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 Conference on Supercomputing (Albuquerque, NM, Nov. 18-22), J. L. Martin, Ed. ACM Press, New York, NY, 176-186. Google Scholar
BELADY, L. A. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5, 2, 79-101.Google Scholar
BODIN, F., BECKMAN, P., GANNON, D., GOTWALS, J., NARAYANA, S., SRINIVAS, S., AND WINNICKA, B. 1994. Sage11: An object-oriented toolkit and class library for building Fortran and C11 structuring tools. In Proceedings of the 2nd Annual Object-Oriented Numerics Conference (OON-SKI '94, Sun River, OR, Apr.).Google Scholar
BURGER, D., GOODMAN,J.R.,AND K~GI, A. 1996. Memory bandwidth limitations of future microprocessors. SIGARCH Comput. Arch. News 24,2,78-89. Google Scholar
CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990. Improving register allocation for subscripted variables. SIGPLAN Not. 25, 6 (June), 53-65. Google Scholar
CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991. Software prefetching. SIGARCH Comput. Arch. News 19, 2 (Apr. 1991), 40-52. Google Scholar
CARR,S.AND KENNEDY, K. 1994. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. Program. Lang. Syst. 16, 6 (Nov.), 1768-1810. Google Scholar
CHEN,T.F.AND BEAR, J. L. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (May), 609-623. Google Scholar
COLEMAN,S.AND MCKINLEY, K. S. 1995. Tile size selection using cache organization and data layout. SIGPLAN Not. 30, 6 (June 1995), 279-290. Google Scholar
COOPER, K., KENNEDY, K., AND MCINTOSH, N. 1995. An emprical study of cross-loop reuse in the NAS benchmarks. Tech. Rep. CRPC-TR95519-S. Center for Research on Parallel Computation, Rice University, Houston, TX.Google Scholar
COOPER, K., KENNEDY, K., AND MCINTOSH, N. 1996. Cross-loop reuse analysis and its application to cache optimizations. In Proceedings of the 9th Workshop on Languages and Compilers for Parallel Computing (Santa Clara, CA). Google Scholar
CYBENKO, G., KIPP, L., POINTER, L., AND KUCK, D. 1990. Supercomputer performance evaluation and the Perfect Benchmarks. SIGARCH Comput. Arch. News 18, 3, 254-266. Google Scholar
DEC. 1994. Alpha 21164 microprocessor, hardware reference manual. Digital Equipment Corp., Maynard, MA.Google Scholar
DRACH, N. 1995. Hardware implementation issues of data prefetching. In Proceedings of the 9th ACM International Conference on Supercomputing (ICS '95, Barcelona, Spain, July 3-7, 1995), M. Valero, Ed. ACM Press, New York, NY, 245-254. Google Scholar
GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput. 5, 5 (Oct. 1988), 587-616. Google Scholar
GEE,J.D.,HILL,M.D.,AND PNEVMATIKATOS, D. N. 1993. Cache performance of the SPEC92 benchmark suite. IEEE Micro 13, 4 (Aug.), 17-27. Google Scholar
GHOSH, S., MARTONOSI, M., AND MALIK, S. 1998. Precise miss analysis for program transfor-mations with caches of arbitrary associativity. SIGPLAN Not. 33, 11, 228-239.Google Scholar
HENNESSY,J.L.AND PATTERSON, D. A. 1996. Computer Architecture: A Quantitative Approach. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA. Google Scholar
HILL, M. D. 1987. Aspects of cache memory and instruction buffer performance. Ph.D. Dissertation. Computer Science Department, University of California at Berkeley, Berke-ley, CA. Google Scholar
HILL, M. D. 1988. A case for direct-mapped caches. IEEE Computer 21, 12 (Dec. 1988), 25-40. Google Scholar
HILL,M.D.AND SMITH, A. J. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12 (Dec. 1989), 1612-1631. Google Scholar
JOUPPI, N. P. 1998. Improving direct-mapped cache performance by the addition of a small fully-associative cache prefetch buffers. In Computer Architecture (ISCA '98), G. S. Sohi, Ed. ACM Press, New York, NY, 388-397. Google Scholar
KAPLAN,K.R.AND WINDER, R. O. 1973. Cache based computer systems. IEEE Computer 6,3, 30-36.Google Scholar
KLAIBER,A.C.AND LEVY, H. M. 1991. An architecture for software-controlled data prefetching. SIGARCH Comput. Arch. News 19, 3 (May 1991), 43-53. Google Scholar
LAM, M., ROTHBERG, E., AND WOLF, M. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV, Santa Clara, CA, Apr. 8-11), D. A. Patterson, Ed. ACM Press, New York, NY, 63-74. Google Scholar
LI,W.AND PINGALI, K. 1992. Access normalization: Loop restructuring for NUMA compilers. SIGPLAN Not. 27, 9 (Sept. 1992), 285-295. Google Scholar
MCKEE,S.A.AND WULF, W. A. 1996. A memory controller for improved performance of streamed computations on symmetric multiprocessors. In Proceedings of 25th International Conference on Parallel Processing (Aug.). Google Scholar
MCKINLEY,K.S.AND TEMAM, O. 1996. A quantitative analysis of loop nest locality. ACM SIGOPS Oper. Syst. Rev. 30, 5, 94-104. MCKINLEY,K.S.,CARR, S., AND TSENG, C.-W. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4 (July), 424-453. Google Scholar
MOWRY,T.C.,LAM,M.S.,AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V, Boston, MA, Oct. 12-15), S. Eggers, Ed. ACM Press, New York, NY, 62-73. Google Scholar
PRZYBYLSKI, S., HOROWITZ, M., AND HENNESSY, J. 1988. Performance tradeoffs in cache design. In The 15th Annual International Symposium on Computer Architecture (ISCA '88, Honolulu, HI, May 30-June 2), H. J. Siegel, Ed. IEEE Computer Society Press, Los Alamitos, CA, 290-298. Google Scholar
REILLY, J. 1995. SPEC describes SPEC'95 product and benchmarks. SPEC Newslett. (Sept.). Available via http://www.spec.org/osg/news/articles/news9509/cpu95descr.html.Google Scholar
SMITH, A. J. 1982. Cache memories. ACM Comput. Surv. 14, 3 (Sept.), 473-530. Google Scholar
SMITH, A. J. 1986. Bibliography and readings on CPU cache memories and related topics. SIGARCH Comput. Arch. News 14, 1 (Jan. 1986), 22-42. Google Scholar
SMITH, A. J. 1987. Line (block) size choice for CPU cache memories. IEEE Trans. Comput. C-36, 9 (Sept. 1987), 1063-1076. Google Scholar
SMITH, A. J. 1991. Second bibliography on cache memories. SIGARCH Comput. Arch. News 19, 4 (June 1991), 154-182. Google Scholar
SRIVASTAVA,A.AND EUSTACE, A. 1994. ATOM: a system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN '94 Conference on Programming Language, Design and Implementation (PLDI '94, Orlando, FL, June 20-24, 1994), V. Sarkar, B. Ryder, and M. L. Soffa, Eds. ACM Press, New York, NY, 196-205. Google Scholar
SUGUMAR,R.A.AND ABRAHAM, S. G. 1993. Efficient simulation of caches under optimal replacement with applications to miss characterization. SIGMETRICS Perform. Eval. Rev. 21, 1 (June 1993), 24-35. Google Scholar
TEMAM, O., GRANSTON,E.D.,AND JALBY, W. 1993. To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In Proceedings of the Conference on Supercomputing (Supercomputing '93, Portland, OR, Nov. 15-19), B. Borchers and D. Crawford, Eds. IEEE Computer Society Press, Los Alamitos, CA, 410-419. Google Scholar
TYSON, G., FARRENS, M., MATTHEWS, J., AND PLESZKUN, A. R. 1995. A modified approach to data cache management. In Proceedings of the 28th annual international symposium on Microarchitecture (Ann Arbor, MI, Nov. 29 - Dec. 1, 1995), T. Mudge and K. Ebciog? lu, Eds. IEEE Computer Society Press, Los Alamitos, CA, 93-103. Google Scholar
WOLF,M.E.AND LAM, M. S. 1991. A data locality optimization algorithm. In Proceedings of the ACM Conference on Programming Language Design and Implementation (SIGPLAN '91, Toronto, Ontario, Canada, June 26-28), D. S. Wise, Ed. ACM Press, New York, NY, 30-44. Google Scholar
WOLFE, M. 1987. Iteration space titling for memory hieararchies. In Proceedings of the 3rd SIAM Conference on Parallel Processing for Scientific Computing (Dec.) SIAM, Philadel-phia, PA. Google Scholar
WOOD,D.A.,HILL,M.D.,AND KESSLER, R. E. 1991. A model for estimating trace-sample miss ratios. SIGMETRICS Perform. Eval. Rev. 19, 1 (May 1991), 79-89. Google Scholar

Recommendations

A quantitative analysis of loop nest locality

This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast ...
Read More
Timing optimization via nest-loop pipelining considering code size

Embedded systems have strict timing and code size requirements. Software pipelining is one of the most important optimization techniques to improve the execution time of loops by increasing the parallelism among successive loop iterations. However, ...
Read More
A quantitative analysis of loop nest locality
ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems

This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 17, Issue 4
Nov. 1999
123 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/329466
Issue’s Table of Contents

Copyright © 1999 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 1999
Published in tocs Volume 17, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 588
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems

Abstract

References

Cited By

Recommendations

A quantitative analysis of loop nest locality

Timing optimization via nest-loop pipelining considering code size

A quantitative analysis of loop nest locality

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems

Abstract

References

Cited By

Recommendations

A quantitative analysis of loop nest locality

Timing optimization via nest-loop pipelining considering code size

A quantitative analysis of loop nest locality

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media