ABSTRACT
The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.
- K. Olukotun and L. Hammond. The future of microprocessors. ACM Queue, 3(7):26--29, 2005. Google ScholarDigital Library
- Teraflops Research Chip. http://www.intel.com/research/platform/terascale/teraflops.htm.Google Scholar
- H. Sutter and J. Larus. Software and the concurrency revolution. ACM Queue, 3(7), 2005. Google ScholarDigital Library
- S. F. Lundstrom and G. H. Barnes. A controllable MIMD architectures. In Proceedings of the 1980 International Conference on Parallel Processing, pages 19--27, St. Charles, IL, August 1980.Google Scholar
- SPEC CFP2000. http://www.spec.org/cpu2000/CFP2000.Google Scholar
- M. R. Haghighat and Constantine D. Polychronopoulos. Symbolic analysis for parallelizing compilers. ACM Transactions on Programming Languages and Systems, 18(4):477--518, July 1996. Google ScholarDigital Library
- R. Sakellariou. On the Quest for Perfect Load Balance in Loop-Based Parallel Computations. PhD thesis, Department of Computer Science, University of Manchester, October 1996.Google Scholar
- C. Polychronopoulos, D. J. Kuck, and D. A. Padua. Execution of parallel loops on parallel processor systems. In Proceedings of the 1986 International Conference on Parallel Processing, pages 519--527, August 1986.Google Scholar
- E. H. D'Hollander. Partitioning and labeling of loops by unimodular transformations. IEEE Transactions on Parallel and Distributed Systems, 3(4):465--476, 1992. Google ScholarDigital Library
- A. Kejariwal, H. Saito, X. Tian, M. Girkar, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. A general approach for partitioning n-dimensional parallel nested loops with conditionals. In Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 49--58, Cambridge, MA, 2006. Google ScholarDigital Library
- A. Kejariwal, A. Nicolau, U. Banerjee, and C. D. Polychronopoulos. A novel approach for partitioning iteration spaces with variable densities. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120--131, Chicago, IL, 2005. Google ScholarDigital Library
- A. Kejariwal, P. D'Alberto, A. Nicolau, and C. D. Polychronopoulos. A geometric approach for partitioning N-dimensional non-rectangular iteration spaces. In Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing, pages 102--116, West Lafayette, IN, 2004. Google ScholarDigital Library
- M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996. Google ScholarDigital Library
- Z. Guz, I. Keidar, A. Kolodny, and U. Weiser. Nahalal: Cache organization for chip multiprocessors. IEEE Computer Architecture Letters, 6(1), 2007. Google ScholarDigital Library
- S. Ghosh and M. Montonosi amd S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 11th ACM International Conference on Supercomputing, pages 317--324, Vienna, Austria, July 1997. Google ScholarDigital Library
- J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of set-associative cache behavior. IEEE Transactions on Computers, 48(10):1009--1024, 1999. Google ScholarDigital Library
- S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck. Exact analysis of the cache behavior of nested loops. In Proceedings of the SIGPLAN '01 Conference on Programming Language Design and Implementation, pages 286--297, Snowbird, UT, 2001. Google ScholarDigital Library
- B. B. Fraguela, R. Doallo, J. Touri no, and E. L. Zapata. A compiler tool to predict memory hierarchy performance of scientific codes. Parallel Computing, 30(2):225--248, 2004. Google ScholarDigital Library
- C. Polychronopoulos. Loop coalescing: A compiler transformation for parallel machines. In Proceedings of the 1987 International Conference on Parallel Processing, pages 235--242, August 1987.Google Scholar
- SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.Google Scholar
- SPEC CFP2006. http://www.spec.org/cpu2006/CFP2006.Google Scholar
- Intel® VTune#8482; Performance Analyzer 8.0.1 for Windows. http://www.intel.com/cd/software/products/asmo-na/eng/vtune/219898.htm.Google Scholar
- J. Hennessy and D. Patterson. Computer Architecture A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA, 1990. Google ScholarDigital Library
- OpenMP Specification, version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.Google Scholar
- Z. Li. Array privatization for parallel execution of loops. In Proceedings of the 1992 ACM International Conference on Supercomputing, pages 313--322, Washington, D. C, 1992. Google ScholarDigital Library
- E. W. Weisstein. Abel's impossibility theorem. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/AbelsImpossibilityTheorem.html.Google Scholar
- D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184--1201, December 1986. Google ScholarDigital Library
- M. J. Wolfe. Iteration space tiling for memory hierarchies, December 1987.Google Scholar
- M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991. Google ScholarDigital Library
- M. E. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452--471, October 1991. Google ScholarDigital Library
- M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 274--286, Paris, France, 1996. Google ScholarDigital Library
- T. Ball and J. Larus. Branch prediction for free. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 300--313, Albuquerque, NM, June 1993. Google ScholarDigital Library
- A. Krall. Improving semi-static branch prediction by code replication. In Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation, pages 97--106, Orlando, FL, 1994. Google ScholarDigital Library
- Intel® Compilers for Linux. http://www.intel.com/cd/software/products/asmo-na/eng/compilers/284264.htm.Google Scholar
- D. Kuck, A. H. Sameh, R. Cytron, A. Veidenbaum, C. D. Polychronopoulos, G. Lee, T. McDaniel, B. R. Leasure, C. Beckman, J. R. B Davies, and C. P. Kruskal. The effects of program restructuring, algorithm change and architecture choice on program performance. In Proceedings of the 1984 International Conference on Parallel Processing, pages 129--138, August 1984.Google Scholar
- M. J. Wolfe. Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge, MA, 1989. Google ScholarDigital Library
- D. Kulkarni, K. Kumar, A. Basu, and A. Paulraj. Loop partitioning for distributed memory multiprocessors as unimodular transformations. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991. Google ScholarDigital Library
- M. O'Boyle and G. A. Hedayat. Program and data transformations for efficient execution on distributed memory architectures. Technical Report UMCS-93-1-6, Department of Computer Science, University of Manchester, 1992.Google Scholar
- J. Sheu and T. Thai. Partitioning and mapping nested for-loops on multiprocessor systems. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991.Google Scholar
- J.-P. Sheu and T.-S. Chen. Partitioning and mapping of nested loops for linear array multicomputers. Journal of Supercomputing, 9(1--2):183--202, 1995. Google ScholarDigital Library
- I. Drositis, G. Goumas, N. Koziris, P. Tsanakas, and G. Papakonstantinou. Evaluation of loop grouping methods based on orthogonal projection spaces. In Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing, pages 469--476, August 2000. Google ScholarDigital Library
- A. Asthana, H. V. Jagadish, J. A. Chandross, D. Lin, and S. C. Knauer. An intelligent memory system. SIGARCH Computer Architecture News, 16(4):12--20, 1988. Google ScholarDigital Library
- J. P. Moskowitz and C. Jousselin. An algebraic memory model. SIGARCH Computer Architecture News, 17(1):55--62, 1989. Google ScholarDigital Library
- E. Pegg., T. Rowland, and E. W. Weisstein. Cayley graph. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/CayleyGraph.html.Google Scholar
- J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, CA, August 1991. Springer-Verlag. Google ScholarDigital Library
- O. Temam, C. Fricker, and W. Jalby. Cache interference phenomena. In Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 261--271, Nashville, TN, 1994. Google ScholarDigital Library
- N. Bermudo, X. Vera, A. González, and J. Llosa. Optimizing cache miss equations polyhedra. SIGARCH Computer Architecture News, 28(1):43--52, 2000. Google ScholarDigital Library
- G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, pages 85--96, 1997. Google ScholarDigital Library
- F. Schneider and T. Gross. Using platform-specific performance counters for dynamic compilation. In Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing, Hawthorne, NY, October 2005. Google ScholarDigital Library
- F. Schneider, M. Payer, and T. Gross. Online optimizations driven by hardware performance monitoring. In Proceedings of the SIGPLAN '07 Conference on Programming Language Design and Implementation, 2007. Google ScholarDigital Library
- V. Sarkar and B. Simons. Parallel program graphs and their classification. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, Portland, OR, August 1993. Google ScholarDigital Library
Index Terms
- Cache-aware partitioning of multi-dimensional iteration spaces
Recommendations
A novel approach for partitioning iteration spaces with variable densities
PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programmingEfficient partitioning of parallel loops plays a critical role in high performance and efficient use of multiprocessor systems. Although a significant amount of work has been done in partitioning and scheduling of loops with rectangular iteration spaces,...
A general approach for partitioning N-dimensional parallel nested loops with conditionals
SPAA '06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architecturesParallel loops account for the greatest amount of parallelism in scientific and numerical codes. For example, most of the DO loops in SPEC CFP2000 and SPEC OMPM2001 are of DOALL type and account for a large percentage of the total execution time. One of ...
An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing
Parallel processing systems with cache or local memory in the memory hierarchies are considered. These systems have a local cache memory in each processor and usually employ a write-invalidate protocol for the cache coherence. In such systems, a problem ...
Comments