skip to main content
article

Predicting whole-program locality through reuse distance analysis

Published:09 May 2003Publication History
Skip Abstract Section

Abstract

Profiling can accurately analyze program behavior for select data inputs. We show that profiling can also predict program locality for inputs other than profiled ones. Here locality is defined by the distance of data reuse. Studying whole-program data reuse may reveal global patterns not apparent in short-distance reuses or local control flow. However, the analysis must meet two requirements to be useful. The first is efficiency. It needs to analyze all accesses to all data elements in full-size benchmarks and to measure distance of any length and in any required precision. The second is predication. Based on a few training runs, it needs to classify patterns as regular and irregular and, for regular ones, it should predict their (changing) behavior for other inputs. In this paper, we show that these goals are attainable through three techniques: approximate analysis of reuse distance (originally called LRU stack distance), pattern recognition, and distance-based sampling. When tested on 15 integer and floating-point programs from SPEC and other benchmark suites, our techniques predict with on average 94% accuracy for data inputs up to hundreds times larger than the training inputs. Based on these results, the paper discusses possible uses of this analysis.

References

  1. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Almasi, C. Cascaval, and D. Padua. Calculating stack distances efficiently. In Proceedings of the first ACM SIGPLAN Workshop on Memory System Performance, Berlin, Germany, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Arnold and B. G. Ryder. A framework for reducing the cost of instrumented code. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, Snowbird, Utah, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Balasubramonian, D. Albonesi, A. Buyuktos, and S. Dwarkadas. Dynamic memory hierarchy performance and energy optimization. In Proceedings of the 27th Annual International Symposium on Computer Architecture, 2000.Google ScholarGoogle Scholar
  5. V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer. A static performance estimator to guide data partitioning decisions. In Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Williamsburg, VA, Apr. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. T. Bennett and V. J. Kruskal. LRU stack processing. IBM Journal of Research and Development, pages 353--357, 1975.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Beyls and E. D'Hollander. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems, 2001.Google ScholarGoogle Scholar
  8. K. Beyls and E. D'Hollander. Reuse distance-based cache hint selection. In Proceedings of the 8th International Euro-Par Conference, Paderborn, Germany, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined machines. Journal of Parallel and Distributed Computing, 5(4):334--358, Aug. 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768--1810, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. C. Cascaval. Compile-time Performance Prediction of Scientific Programs. PhD thesis, University of Illinois at Urbana-Champaign, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, Snowbird, Utah, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. M. Chilimbi. On the stability of temporal data reference profiles. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, Barcelona, Spain, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, Berlin, Germany, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Cocke and K. Kennedy. Profitability computations on program flow graphs. Technical Report RC 5123, IBM, 1974.Google ScholarGoogle Scholar
  16. R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang. Communication optimizations for irregular scientific computations on distributed memory architectures. Journal of Parallel and Distributed Computing, 22(3):462--479, Sept. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Ding. Improving Effective Bandwidth through Compiler Enhancement of Global and Dynamic Cache Reuse. PhD thesis, Dept. of Computer Science, Rice University, January 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation, Atlanta, GA, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Ding and Y. Zhong. Compiler-directed run-time monitoring of program data access. In Proceedings of the first ACM SIGPLAN Workshop on Memory System Performance, Berlin, Germany, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Eeckhout, H. Vandierendonck, and K. D. Bosschere. Workload design: selecting representative program-input pairs. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, Charlottesville, Virginia, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Han and C. W. Tseng. Locality optimizations for adaptive irregular scientific codes. Technical report, Department of Computer Science, University of Maryland, College Park, 2000.Google ScholarGoogle Scholar
  22. M. D. Hill. Aspects of cache memory and instruction buffer performance. PhD thesis, University of California, Berkeley, November 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Hsu, H. Chen, P. C. Yew, and D. Chen. On the predictability of program behavior using different input data sets. In Proceedings of the Sixth Workshop on Interaction Between Compilers and Computer Architectures (INTERACT), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Jiang and X. Zhang. LIRS: an efficient low inter-reference recency set replacement to improve buffer cache performance. In Proceedings of ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, Marina Del Rey, California, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. H. Kim, M. D. Hill, and D. A. Wood. Implementing stack simulation for highly-associative memories. In Proc. ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 212--213, May 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Knobe and V. Sarkar. Array SSA form and its use in parallelization. In Proceedings of Symposium on Principles of Programming Languages, San Diego, CA, January 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Knuth. An empirical study of FORTRAN programs. Software---Practice and Experience, 1:105--133, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  28. T. Lafage and A. Seznec. Choosing representative slices of program execution for microarchitecture simulations: a preliminary application to the data stream. In Workload Characterization of Emerging Applications, Kluwer Academic Publishers, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Z. Li, J. Gu, and G. Lee. An evaluation of the potential benefits of register allocation for array references. In Workshop on Interaction between Compilers and Computer Architectures in conjuction with the HPCA-2, San Jose, California, February 1996.Google ScholarGoogle Scholar
  30. R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM System Journal, 9(2):78--117, 1970.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. S. McKinley and O. Temam. Quantifying loop nest locality using SPEC'95 and the perfect benchmarks. ACM Transactions on Computer Systems, 17(4):288--336, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Mellor-Crummey, R. Fowler, and D. B. Whalley. Tools for application-oriented performance tuning. In Proceedings of the 15th ACM International Conference on Supercomputing, Sorrento, Italy, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance for irregular applications. International Journal of Parallel Programming, 29(3), June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. F. Olken. Efficient methods for calculating the success function of fixed space replacement policies. Technical Report LBL-12370, Lawrence Berkeley Laboratory, 1981.Google ScholarGoogle Scholar
  35. V. Phalke and B. Gopinath. An inter-reference gap model for temporal locality in program behavior. In Proceedings of ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, Ottawa, Ontario, Canada, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. D. Sleator and R. E. Tarjan. Self adjusting binary search trees. Journal of the ACM, 32(3), 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. So, M. W. Hall, and P. C. Diniz. A compiler approach to fast hardware design space exploration in FPGA-based systems. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, Berlin, Germany, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Srivastava and A. Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, Orlando, Florida, June 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. M. Strout, L. Carter, and J. Ferrante. Compile-time composition of run-time data and iteration reorderings. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego, CA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. A. Sugumar and S. G. Abraham. Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Technical report, University of Michigan, 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. K. O. Thabit. Cache Management by the Compiler. PhD thesis, Dept. of Computer Science, Rice University, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. W. Wall. Predicting program behavior using real or estimated profiles. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, Toronto, Canada, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Y. Zhong, C. Ding, and K. Kennedy. Reuse distance analysis for scientific programs. In Proceedings of Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers, Washington DC, March 2002.Google ScholarGoogle Scholar
  45. Y. Zhou, P. M. Chen, and K. Li. The multi-queue replacement algorithm for second level buffer caches. In Proceedings of USENIX Technical Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Predicting whole-program locality through reuse distance analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 38, Issue 5
      May 2003
      349 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/780822
      Issue’s Table of Contents
      • cover image ACM Conferences
        PLDI '03: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
        June 2003
        360 pages
        ISBN:1581136625
        DOI:10.1145/781131

      Copyright © 2003 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 May 2003

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader