Abstract
On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive accesses to a given location. This article addresses the analysis problem at the program level, where the size of data and the locality of execution may change significantly depending on the input.
The article presents two techniques that predict how the locality of a program changes with its input. The first is approximate reuse-distance measurement, which is asymptotically faster than exact methods while providing a guaranteed precision. The second is statistical prediction of locality in all executions of a program based on the analysis of a few executions. The prediction process has three steps: dividing data accesses into groups, finding the access patterns in each group, and building parameterized models. The resulting prediction may be used on-line with the help of distance-based sampling. When evaluated on fifteen benchmark applications, the new techniques predicted program locality with good accuracy, even for test executions that are orders of magnitude larger than the training executions.
The two techniques are among the first to enable quantitative analysis of whole-program locality in general sequential code. These findings form the basis for a unified understanding of program locality and its many facets. Concluding sections of the article present a taxonomy of related literature along five dimensions of locality and discuss the role of reuse distance in performance modeling, program optimization, cache and virtual memory management, and network traffic analysis.
- Adve, V. and Mellor-Crummey, J. 1998. Using integer sets for data-parallel program analysis and optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Allen, R. and Kennedy, K. 2001. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers. Google ScholarDigital Library
- Almasi, G., Cascaval, C., and Padua, D. 2002. Calculating stack distances efficiently. In Proceedings of the ACM SIGPLAN Workshop on Memory System Performance. Google ScholarDigital Library
- Almeida, V., Bestavros, A., Crovella, M., and de Oliveira, A. 1996. Characterizing reference locality in the WWW. In Proceedings of the International Conference on Parallel and Distributed Information Systems (PDIS). 92--103. Google ScholarDigital Library
- Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the ACM Symposium on Theory of Computing. Google ScholarDigital Library
- Arnold, M. and Ryder, B. G. 2001. A framework for reducing the cost of instrumented code. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Banerjee, U. 1988. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Boston, MA. Google ScholarDigital Library
- Batson, A. P. and Madison, A. W. 1976. Measurements of major locality phases in symbolic reference strings. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems.Google Scholar
- Bennett, B. T. and Kruskal, V. J. 1975. LRU stack processing. IBM J. Resear. Devel. 353--357.Google Scholar
- Berg, E. and Hagersten, E. 2004. Statcache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 20--27. Google ScholarDigital Library
- Berg, E. and Hagersten, E. 2005. Fast data-locality profiling of native execution. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems. 169--180. Google ScholarDigital Library
- Beyls, K. and D'Hollander, E. 2002. Reuse distance-based cache hint selection. In Proceedings of the 8th International Euro-Par Conference. Paderborn, Germany. Google ScholarDigital Library
- Beyls, K. and D'Hollander, E. 2005. Generating cache hints for improved program efficiency. J. Syst. Archit. 51, 4, 223--250. Google ScholarDigital Library
- Beyls, K. and D'Hollander, E. 2006a. Discovery of locality-improving refactoring by reuse path analysis. In Proceedings of the High-Performance Computing and Communications Council. Springer. Lecture Notes in Computer Science, vol. 4208. 220--229. Google ScholarDigital Library
- Beyls, K. and D'Hollander, E. 2006b. Intermediately executed code is the key to find refactorings that improve temporal data locality. In Proceedings of the ACM Conference on Computing Frontiers. Google ScholarDigital Library
- Bunt, R. B. and Murphy, J. M. 1984. Measurement of locality and the behaviour of programs. Comput. J. 27, 3, 238--245. Google ScholarDigital Library
- Burke, M. and Cytron, R. 1986. Interprocedural dependence analysis and parallelization. In Proceedings of the SIGPLAN Symposium on Compiler Construction. Google ScholarDigital Library
- Calder, B., Krintz, C., John, S., and Austin, T. 1998. Cache-conscious data placement. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII). Google ScholarDigital Library
- Callahan, D., Cocke, J., and Kennedy, K. 1988a. Analysis of interprocedural side effects in a parallel programming environment. J. Paral. Distrib. Comput. 5, 5, 517--550. Google ScholarDigital Library
- Callahan, D., Cocke, J., and Kennedy, K. 1988b. Estimating interlock and improving balance for pipelined machines. J. Paral. Distrib. Comput. 5, 4, 334--358. Google ScholarDigital Library
- Carr, S. and Kennedy, K. 1994. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. Program. Lang. Syst. 16, 6, 1768--1810. Google ScholarDigital Library
- Cascaval, C. and Padua, D. A. 2003. Estimating cache misses and locality using stack distances. In Proceedings of the International Conference on Supercomputing. San Francisco, CA. Google ScholarDigital Library
- Chandra, D., Guo, F., Kim, S., and Solihin, Y. 2005. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the International Symposium on High-Performance Computer Architecture. 340--351. Google ScholarDigital Library
- Chatterjee, S., Parker, E., Hanlon, P. J., and Lebeck, A. R. 2001. Exact analysis of the cache behavior of nested loops. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Chen, F., Jiang, S., and Zhang, X. 2005. CLOCK-Pro: An effective improvement of the CLOCK replacement. In Proceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
- Cheng, R. and Ding, C. 2005. Measuring temporal locality variation across program inputs. Tech. rep. TR 875, Department of Computer Science, University of Rochester.Google Scholar
- Chilimbi, T. M. 2001a. Efficient representations and abstractions for quantifying and exploiting data reference locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Chilimbi, T. M. 2001b. On the stability of temporal data reference profiles. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. Google ScholarDigital Library
- Chilimbi, T. M., Hill, M. D., and Larus, J. R. 1999. Cache-conscious structure layout. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Chilimbi, T. M. and Hirzel, M. 2002. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Cierniak, M. and Li, W. 1995. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Cocke, J. and Kennedy, K. 1974. Profitability computations on program flow graphs. Tech. rep. RC 5123, IBM.Google Scholar
- Das, R., Uysal, M., Saltz, J., and Hwang, Y.-S. 1994. Communication optimizations for irregular scientific computations on distributed memory architectures. J. Paral. Distrib. Comput. 22, 3, 462--479. Google ScholarDigital Library
- Datar, M., Gionis, A., Indyk, P., and Motwani, R. 2002. Maintaining stream statistics over sliding windows. SIAM J. Comput. 31, 6, 1794--1813. Google ScholarDigital Library
- Denning, P. 1980. Working sets past and present. IEEE Trans. Softw. Engin. 6, 1. Google ScholarDigital Library
- Ding, C. and Kennedy, K. 1999. Improving cache performance in dynamic applications through data and computation reorganization at runtime. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Ding, C. and Kennedy, K. 2004. Improving effective bandwidth through compiler enhancement of global cache reuse. J. Paral. Distrib. Comput. 64, 1, 108--134. Google ScholarDigital Library
- Ding, C. and Zhong, Y. 2002. Compiler-directed runtime monitoring of program data access. In Proceedings of the ACM SIGPLAN Workshop on Memory System Performance. Google ScholarDigital Library
- Eeckhout, L., Vandierendonck, H., and Bosschere, K. D. 2002. Workload design: Selecting representative program-input pairs. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Fang, C., Carr, S., Onder, S., and Wang, Z. 2005. Instruction-based memory distance analysis and its application to optimization. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. Google ScholarDigital Library
- Ferrante, J., Sarkar, V., and Thrash, W. 1991. On estimating and enhancing cache effectiveness. In Proceedings of the 4th International Workshop on Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, Eds. Springer-Verlag. Google ScholarDigital Library
- Flajolet, P. and Martin, G. 1983. Probabilistic counting. In Proceedings of the Symposium on Foundations of Computer Science. Google ScholarDigital Library
- Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Program. Lang. Syst. 21, 4. Google ScholarDigital Library
- Gu, X., Christopher, I., Bai, T., Zhang, C., and Ding, C. 2009. A component model of spatial locality. In Proceedings of the International Symposium on Memory Management. Google ScholarDigital Library
- Han, H. and Tseng, C.-W. 2006. Exploiting locality for irregular scientific codes. IEEE Trans. Paral. Distrib. Syst. 17, 7, 606--618. Google ScholarDigital Library
- Havlak, P. and Kennedy, K. 1991. An implementation of interprocedural bounded regular section analysis. IEEE Trans. Paral. Distrib. Syst. 2, 3, 350--360. Google ScholarDigital Library
- Hill, M. D. and Smith, A. J. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12, 1612--1630. Google ScholarDigital Library
- Hsu, W., Chen, H., Yew, P. C., and Chen, D. 2002. On the predictability of program behavior using different input data sets. In Proceedings of the 6th Workshop on Interaction Between Compilers and Computer Architectures (INTERACT). Google ScholarDigital Library
- Jiang, S. and Zhang, X. 2002. LIRS: An efficient low inter-reference recency set replacement to improve buffer cache performance. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems. Google ScholarDigital Library
- Jiang, Y., Shen, X., Chen, J., and Tripathi, R. 2008. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 220--229. Google ScholarDigital Library
- Kandemir, M. T. 2005. Improving whole-program locality using intra-procedural and inter-procedural transformations. J. Paral. Distrib. Comput. 65, 5, 564--582. Google ScholarDigital Library
- Kelly, T., Cohen, I., Goldszmidt, M., and Keeton, K. 2004. Inducing models of black-box storage arrays. Tech. rep. HPL-2004-108, HP Laboratories Palo Alto, CA.Google Scholar
- Kelly, W., Maslov, V., Pugh, W., Rosser, E., Shpeisman, T., and Wonnacott, D. 1996. The Omega Library Interface Guide. Tech. rep., Department of Computer Science, University of Maryland, College Park. Google ScholarDigital Library
- Kelsey, K., Bai, T., and Ding, C. 2009. Fast track: A software system for speculative optimization. In Proceedings of the International Symposium on Code Generation and Optimization. Google ScholarDigital Library
- Kim, Y. H., Hill, M. D., and Wood, D. A. 1991. Implementing stack simulation for highly-associative memories. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 212--213. Google ScholarDigital Library
- KleinOsowski, A. and Lilja, D. J. 2002. MinneSPEC: A new SPEC benchmark workload for simulation-based computer architecture research. Comput. Archit. Lett. 1. Google ScholarDigital Library
- Knobe, K. and Sarkar, V. 1998. Array SSA form and its use in parallelization. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. Google ScholarDigital Library
- Knuth, D. 1971. An empirical study of FORTRAN programs. Softw. Pract. Exper. 1, 105--133.Google ScholarCross Ref
- Kodukula, I., Ahmed, N., and Pingali, K. 1997. Data-centric multi-level blocking. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Li, Z., Yew, P., and Zhu, C. 1990. An efficient data dependence analysis for parallelizing compilers. IEEE Trans. Paral. Distrib. Syst. 1, 1, 26--34. Google ScholarDigital Library
- Liu, J., Chen, H., Yew, P.-C., and Hsu, W.-C. 2004. Design and implementation of a lightweight dynamic optimization system. J. Instruct.-Level Paral. 6.Google Scholar
- Marin, G. and Mellor-Crummey, J. 2004. Cross architecture performance predictions for scientific applications using parameterized models. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems. Google ScholarDigital Library
- Marin, G. and Mellor-Crummey, J. 2005. Scalable cross-architecture predictions of memory hierarchy response for scientific applications. In Proceedings of the Symposium of the Las Alamos Computer Science Institute.Google Scholar
- Mattson, R. L., Gecsei, J., Slutz, D., and Traiger, I. L. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2, 78--117.Google ScholarDigital Library
- McKinley, K. S., Carr, S., and Tseng, C.-W. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4, 424--453. Google ScholarDigital Library
- Mellor-Crummey, J., Whalley, D., and Kennedy, K. 2001. Improving memory hierarchy performance for irregular applications. Int. J. Paral. Program. 29, 3. Google ScholarDigital Library
- Olken, F. 1981. Efficient methods for calculating the success function of fixed space replacement policies. Tech. rep. LBL-12370, Lawrence Berkeley Laboratory.Google Scholar
- Petrank, E. and Rawitz, D. 2002. The hardness of cache conscious data placement. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. Google ScholarDigital Library
- Porterfield, A. 1989. Software methods for improvement of cache performance. Ph.D. thesis, Department of Computer Science, Rice University.Google Scholar
- Rawlings, J. O. 1988. Applied Regression Analysis: A Research Tool. Wadsworth and Brooks.Google Scholar
- Rothberg, E., Singh, J. P., and Gupta, A. 1993. Working sets, cache sizes, and node granularity issues for large-scale multiprocessors. In Proceedings of the International Symposium on Computer Architecture. 14--25. Google ScholarDigital Library
- Seidl, M. L. and Zorn, B. G. 1998. Segregating heap objects by reference behavior and lifetime. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
- Shen, X., Gao, Y., Ding, C., and Archambault, R. 2005. Lightweight reference affinity analysis. In Proceedings of the 19th ACM International Conference on Super-Computing. 131--140. Google ScholarDigital Library
- Shen, X., Shaw, J., Meeker, B., and Ding, C. 2007. Locality approximation using time. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 55--61. Google ScholarDigital Library
- Shen, X., Zhang, C., Ding, C., Scott, M., Dwarkadas, S., and Ogihara, M. 2007. Analysis of input-dependent program behavior using active profiling. In Proceedings of The 1st Workshop on Experimental Computer Science. Google ScholarDigital Library
- Shen, X., Zhong, Y., and Ding, C. 2004a. Locality phase prediction. In Proceedings of the International Conference on Architectual Support for Programming Languages and Operating Systems. 165--176. Google ScholarDigital Library
- Shen, X., Zhong, Y., and Ding, C. 2004b. Phase-based miss rate prediction. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing.Google Scholar
- Shen, X., Zhong, Y., and Ding, C. 2007. Predicting locality phases for dynamic memory optimization. J. Paral. Distrib. Comput. 67, 7, 783--796. Google ScholarDigital Library
- Sleator, D. D. and Tarjan, R. E. 1985. Self adjusting binary search trees. J. ACM 32, 3. Google ScholarDigital Library
- Smaragdakis, Y., Kaplan, S., and Wilson, P. 2003. The EELRU adaptive replacement algorithm. Perform. Eval. 53, 2, 93--123. Google ScholarDigital Library
- Smith, A. J. 1976. On the effectiveness of set associative page mapping and its applications in main memory management. In Proceedings of the 2nd International Conference on Software Engineering. Google ScholarDigital Library
- So, B., Hall, M. W., and Diniz, P. C. 2002. A compiler approach to fast hardware design space exploration in FPGA-based systems. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Song, Y. and Li, Z. 1999. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Srivastava, A. and Eustace, A. 1994. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Strout, M. M., Carter, L., and Ferrante, J. 2003. Compile-time composition of runtime data and iteration reorderings. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 245--257. Google ScholarDigital Library
- Sugumar, R. A. and Abraham, S. G. 1993. Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Tech. rep., University of Michigan. Google ScholarDigital Library
- Suh, G. E., Devadas, S., and Rudolph, L. 2001. Analytical cache models with applications to cache partitioning. In Proceedings of the International Conference on Super-Computing. 1--12. Google ScholarDigital Library
- Thabit, K. O. 1981. Cache management by the compiler. Ph.D. thesis, Department of Computer Science, Rice University. Google ScholarDigital Library
- Thompson, J. G. and Smith, A. J. 1989. Efficient (stack) algorithms for analysis of write-back and sector memories. ACM Trans. Comput. Syst. 7, 1, 78--117. Google ScholarDigital Library
- Triolet, R., Irigoin, F., and Feautrier, P. 1986. Direct parallelization of CALL statements. In Proceedings of the SIGPLAN Symposium on Compiler Construction. Google ScholarDigital Library
- Wall, D. W. 1991. Predicting program behavior using real or estimated profiles. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Wang, W. and Baer, J.-L. 1991. Efficient trace-driven simulation methods for cache performance analysis. ACM Trans. Comput. Syst. 9, 3. Google ScholarDigital Library
- Wolf, M. E. and Lam, M. 1991. A data locality optimizing algorithm. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Wolfe, M. J. 1996. High-Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA. Google ScholarDigital Library
- Wonnacott, D. 2002. Achieving scalable locality with time skewing. Int. J. Paral. Program. 30, 3. Google ScholarDigital Library
- Xue, J. and Vera, X. 2004. Efficient and accurate analytical modeling of whole-program data cache behavior. IEEE Trans. Comput. 53, 5. Google ScholarDigital Library
- Yang, T., Berger, E. D., Kaplan, S. F., and Moss, J. E. B. 2006. Cramm: Virtual memory support for garbage-collected applications. In Proceedings of the Symposium on Operating Systems Design and Implementation. 103--116. Google ScholarDigital Library
- Yi, Q., Adve, V., and Kennedy, K. 2000. Transforming loops to recursion for multi-level memory hierarchies. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Zhang, C., Ding, C., Ogihara, M., Zhong, Y., and Wu, Y. 2006. A hierarchical model of data locality. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. Google ScholarDigital Library
- Zhao, P., Cui, S., Gao, Y., Silvera, R., and Amaral, J. N. 2007. Forma: A framework for safe automatic array reshaping. ACM Trans. Program. Lang. Syst. 30, 1, 2. Google ScholarDigital Library
- Zhong, Y. and Chang, W. 2008. Sampling-based program locality approximation. In Proceedings of the International Symposium on Memory Management. 91--100. Google ScholarDigital Library
- Zhong, Y., Ding, C., and Kennedy, K. 2002. Reuse distance analysis for scientific programs. In Proceedings of Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers.Google Scholar
- Zhong, Y., Dropsho, S. G., Shen, X., Studer, A., and Ding, C. 2007. Miss rate prediction across program inputs and cache configurations. IEEE Trans. Comput. 56, 3, 328--343. Google ScholarDigital Library
- Zhong, Y., Orlovich, M., Shen, X., and Ding, C. 2004. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Zhou, P., Pandey, V., Sundaresan, J., Raghuraman, A., Zhou, Y., and Kumar, S. 2004. Dynamic tracking of page miss ratio curve for memory management. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
- Zhou, Y., Chen, P. M., and Li, K. 2001. The multi-queue replacement algorithm for second-level buffer caches. In Proceedings of the USENIX Technical Conference. Google ScholarDigital Library
Index Terms
- Program locality analysis using reuse distance
Recommendations
Predicting whole-program locality through reuse distance analysis
PLDI '03: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementationProfiling can accurately analyze program behavior for select data inputs. We show that profiling can also predict program locality for inputs other than profiled ones. Here locality is defined by the distance of data reuse. Studying whole-program data ...
Reuse Distance-Based Probabilistic Cache Replacement
This article proposes Probabilistic Replacement Policy (PRP), a novel replacement policy that evicts the line with minimum estimated hit probability under optimal replacement instead of the line with maximum expected reuse distance. The latter is ...
Reuse-based online models for caches
SIGMETRICS '13: Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systemsWe develop a reuse distance/stack distance based analytical modeling framework for efficient, online prediction of cache performance for a range of cache configurations and replacement policies LRU, PLRU, RANDOM, NMRU. Our framework unifies existing ...
Comments