Abstract
Optimizing compilers implement program transformation strategies aimed at reducing data movement to or from main memory by exploiting the data-cache hierarchy. However, instead of attempting to minimize the number of cache misses, very approximate cost models are used, due to the lack of precise compile-time models for misses for hierarchical caches. The current state of practice for cache miss analysis is based on accurate simulation. However, simulation requires time proportional to the dataset/problem size, as well as the number of distinct cache configurations of interest to be evaluated.
This paper takes a fundamentally different approach, by focusing on polyhedral programs with static control flow. Instead of relying on costly simulation, a closed-form solution for modeling of misses in a set associative cache hierarchy is developed. This solution can enable program transformation choice at compile time to optimize cache misses. A tool implementing the approach has been developed and used for validation of the framework.
Supplemental Material
- M. Adams. 2014. HPGMG: a benchmark for ranking high performance computing systems. (2014). https://www.hpgmg.org/Google Scholar
- A. Agarwal, J. Hennessy, and M. Horowitz. 1989. An Analytical Cache Model. ACM Transactions on Computer Systems (1989), 184ś215.Google Scholar
- N. Ahmed, N. Mateev, and K. Pingali. 2001. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. International Journal of Parallel Programming (2001), 493ś544.Google Scholar
- M. Alt, C. Ferdinand, F. Martin, and R. Wilhelm. 1996. Cache behavior prediction by abstract interpretation. In International Static Analysis Symposium (SAS’96). 52ś66. Google ScholarCross Ref
- W. Bao, C. Hong, S. Chunduri, S. Krishnamoorthy, N. Pouchet, F. Rastello, and P. Sadayappan. 2016a. Static and Dynamic Frequency Scaling on Multicore CPUs. ACM Transactions on Architecture and Code Optimization (2016), 1ś26.Google Scholar
- W. Bao, S. Krishnamoorthy, L. Pouchet, F. Rastello, and P. Sadayappan. 2016b. PolyCheck: Dynamic Veriication of Iteration Space Transformations on Aine Programs. ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’16) (2016), 539ś554.Google Scholar
- W. Bao, P. Rawat, M. Kong, S. Krishnamoorthy, L. Pouchet, and P. Sadayappan. 2017. Eicient Cache Simulation for Aine Computations. In International Workshop on Languages and Compilers for Parallel Computing (LCPC’17).Google Scholar
- W. Bao, S. Tavarageri, F. Ozguner, and P. Sadayappan. 2014. PWCET: Power-Aware Worst Case Execution Time Analysis. In 43rd International Conference on Parallel Processing Workshops. 439ś447.Google Scholar
- E. Berg and E. Hagersten. 2004. StatCache: a probabilistic approach to eicient and accurate data locality analysis. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’04). 20ś27. Google ScholarCross Ref
- Kristof Beyls and Erik H. D’Hollander. 2005. Generating cache hints for improved program eiciency. Journal of Systems Architecture 51, 4 (2005), 223 ś 250.Google ScholarDigital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08).Google Scholar
- T. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout. 2014. An Evaluation of High-Level Mechanistic Core Models. ACM Transactions on Architecture and Code Optimization (2014).Google Scholar
- S. Carr, S. McKinley, and C. Tseng. 1994. Compiler Optimizations for Improving Data Locality. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94). 252ś262. Google ScholarDigital Library
- C. Cascaval and A. Padua. 2003. Estimating cache misses and locality using stack distances. In 17th Annual International Conference on Supercomputing (ICS’03). 150ś159.Google Scholar
- S. Chatterjee, E. Parker, J. Hanlon, and R. Lebeck. 2001. Exact Analysis of the Cache Behavior of Nested Loops. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’01). 286ś297. Google ScholarDigital Library
- J. Edler and M. Hill. 1999. Dinero IV Trace-Driven Uniprocessor Cache Simulator. http://pages.cs.wisc.edu/~markhill/ DineroIVGoogle Scholar
- C. Fang, S. Can, S. Onder, and Z. Wang. 2005. Instruction based memory distance analysis and its application to optimization. In International Conference on Parallel Architectures and Compilation Techniques (PACT’05). 27ś37.Google Scholar
- C. Fang, S. Carr, S. Önder, and Z. Wang. 2004. Reuse-distance-based miss-rate prediction on a per instruction basis. In Proc. 2004 Workshop on Memory System Performance. 60ś68. Google ScholarDigital Library
- P. Feautrier. 1992. Some eicient solutions to the aine scheduling problem, part II: multidimensional time. International Journal of Parallel Programming (1992), 389ś420.Google Scholar
- J. Ferrante, V. Sarkar, and W. Thrash. 1991. On estimating and enhancing cache efectiveness. In International Workshop on Languages and Compilers for Parallel Computing (LCPC’91). 328ś343.Google Scholar
- B. Fraguela, R. Doallo, and L. Zapata. 1999. Automatic analytical modeling for the estimation of cache misses. In International Conference on Parallel Architectures and Compilation Techniques (PACT’99). 221ś231. Google ScholarCross Ref
- B. Fraguela, R. Doallo, and L. Zapata. 2003. Probabilistic miss equations: Evaluating memory hierarchy performance. IEEE Trans. Comput. (2003), 321ś336.Google Scholar
- A. Frumkin and Rob F. Van W. 2002. Tight bounds on cache use for stencil operations on rectangular grids. J. ACM (2002), 434ś453.Google Scholar
- S. Ghosh, M. Martonosi, and S. Malik. 1998. Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98). 228ś239. Google ScholarDigital Library
- S. Ghosh, M. Martonosi, and S. Malik. 1999. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Programming Languages and Systems (1999), 703ś746.Google Scholar
- S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. 2006. Semi-Automatic Composition of Loop Transformations. International Journal of Parallel Programming (2006), 261ś317.Google Scholar
- S. Harper, J. Kerbyson, and R. Nudd. 1999. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput. (1999), 1009ś1024.Google Scholar
- D. Hill and J. Smith. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. (1989), 1612ś1630.Google Scholar
- C. Hong, W. Bao, A. Cohen, S. Krishnamoorthy, L. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan. 2016. Efective Padding of Multidimensional Arrays to Avoid Cache Conlict Misses. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16) (2016), 129ś144.Google Scholar
- W. Kelly and W. Pugh. 1993. A Framework for Unifying Reordering Transformations. Technical Report.Google Scholar
- M. Kong, R. Veras, K. Stock, F. Franchetti, L. Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). 127ś138. Google ScholarDigital Library
- W. Lim and S. Lam. 1997. Maximizing Parallelism and Minimizing Synchronization with Aine Transforms. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’97). 201ś214.Google Scholar
- C. Oppen. 1978. A 2 2 2pn upper bound on the complexity of Presburger arithmetic. J. Comput. System Sci. (1978), 323ś332.Google Scholar
- L. Pouchet. 2017a. PoCC, the Polyhedral Compiler Collection 1.4. http://pocc.sourceforge.netGoogle Scholar
- L. Pouchet. 2017b. PolyBench/C 4.0. http://polybench.sourceforge.netGoogle Scholar
- H. Ramaprasad and F. Mueller. 2005. Bounding worst-case data cache behavior by analytically deriving cache reference patterns. In 11th IEEE Real Time and Embedded Technology and Applications Symposium (RTAS’05). 148ś157. Google ScholarDigital Library
- G. Rivera and C. Tseng. 1998. Data transformations for eliminating conlict misses. In ACM SIGPLAN conference on Programming language design and implementation (PLDI’98). 38ś49.Google Scholar
- V. Sarkar and N. Megiddo. 2000. An analytical model for loop tiling and its solution. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’00). IEEE, 146ś153. Google ScholarCross Ref
- J. Shirako, K. Sharma, N. Fauzia, L. Pouchet, J. Ramanujam, P Sadayappan, and V. Sarkar. 2012. Analytical bounds for optimal tile size selection. In International Conference on Compiler Construction (CC’12). Springer, 101ś121. Google ScholarDigital Library
- A. Shrivastava, J. Lee, and R. Jeyapaul. 2010. Cache vulnerability equations for protecting data in embedded processor caches from soft errors. In ACM SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’10). 143ś152. Google ScholarDigital Library
- P. Singh, S. Stone, and F. Thiebaut. 1992. A model of workloads and its use in miss-rate prediction for fully associative caches. IEEE Trans. Comput. (1992), 811ś825.Google Scholar
- M. Valiev, J. Bylaska, N. Govind, K. Kowalski, Tjerk P. Straatsma, Hubertus J J. Van D., D. Wang, J. Nieplocha, E. Apra, L. Windus, et al. 2010. NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications (2010), 1477ś1489.Google Scholar
- X. Vera, J. Abella, A. González, and J. Llosa. 2003. Optimizing program locality through CMEs and GAs. In International Conference on Parallel Architectures and Compilation Techniques (PACT’03). 68ś78. Google ScholarCross Ref
- X. Vera, J. Abella, J. Llosa, and A. González. 2005. An Accurate Cost Model for Guiding Data Locality Transformations. ACM Transactions on Programming Languages and Systems (2005), 946ś987.Google Scholar
- X. Vera, N. Bermudo, J. Llosa, and A. González. 2004. A fast and accurate framework to analyze and optimize cache memory behavior. ACM Transactions on Programming Languages and Systems (2004), 263ś300.Google Scholar
- X. Vera and J. Xue. 2002. Let’s study whole-program cache behaviour analytically. In International Symposium on HighPerformance Computer Architecture (HPCA’02). 175ś186. Google ScholarCross Ref
- S. Verdoolaege. 2007. Barvinok, a library for counting the integer points in parametric and non-parametric polytopes. http://barvinok.gforge.inria.frGoogle Scholar
- S. Verdoolaege. 2010a. ISL: An integer set library for the polyhedral model. In the 3rd International Congress on Mathematical Software.Google ScholarCross Ref
- S. Verdoolaege. 2010b. ISL, the Integer Set Library. http://repo.or.cz/w/isl.gitGoogle Scholar
- S. Verdoolaege and T. Grosser. 2012. Polyhedral extraction tool. In 2nd International Workshop on Polyhedral Compilation Techniques.Google Scholar
- S. Verdoolaege, R. Seghir, K. Beyls, V. Loechner, and M. Bruynooghe. 2007. Counting integer points in parametric polytopes using Barvinok’s rational functions. Algorithmica (2007), 37ś66.Google Scholar
- W. Wang and L. Baer. 1990. Eicient Trace-driven Simulation Method for Cache Performance Analysis. In ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’90). 27ś36.Google Scholar
- J. Xue and X. Vera. 2004. Eicient and accurate analytical modeling of whole-program data cache behavior. IEEE Trans. Comput. (2004), 547ś566.Google Scholar
- W. Zhang. 2005. Computing cache vulnerability to transient errors and its implication. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’05). 427ś435. Google ScholarDigital Library
Index Terms
- Analytical modeling of cache behavior for affine programs
Recommendations
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOptimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Analytical Modeling of Set-Associative Cache Behavior
Cache behavior is complex and inherently unstable, yet it is a critical factor affecting program performance. A method of evaluating cache performance is required, both to give quantitative predictions of miss-ratio and information to guide optimization ...
Modeling LRU cache with invalidation
Least Recently Used (LRU) is a very popular caching replacement policy. It is very easy to implement and offers good performance, especially when data requests are temporally correlated, as in the case of web traffic.When the data content can change ...
Comments