skip to main content
10.1145/1534530.1534551acmotherconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

Cache-aware partitioning of multi-dimensional iteration spaces

Published:04 May 2009Publication History

ABSTRACT

The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.

References

  1. K. Olukotun and L. Hammond. The future of microprocessors. ACM Queue, 3(7):26--29, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Teraflops Research Chip. http://www.intel.com/research/platform/terascale/teraflops.htm.Google ScholarGoogle Scholar
  3. H. Sutter and J. Larus. Software and the concurrency revolution. ACM Queue, 3(7), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. F. Lundstrom and G. H. Barnes. A controllable MIMD architectures. In Proceedings of the 1980 International Conference on Parallel Processing, pages 19--27, St. Charles, IL, August 1980.Google ScholarGoogle Scholar
  5. SPEC CFP2000. http://www.spec.org/cpu2000/CFP2000.Google ScholarGoogle Scholar
  6. M. R. Haghighat and Constantine D. Polychronopoulos. Symbolic analysis for parallelizing compilers. ACM Transactions on Programming Languages and Systems, 18(4):477--518, July 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Sakellariou. On the Quest for Perfect Load Balance in Loop-Based Parallel Computations. PhD thesis, Department of Computer Science, University of Manchester, October 1996.Google ScholarGoogle Scholar
  8. C. Polychronopoulos, D. J. Kuck, and D. A. Padua. Execution of parallel loops on parallel processor systems. In Proceedings of the 1986 International Conference on Parallel Processing, pages 519--527, August 1986.Google ScholarGoogle Scholar
  9. E. H. D'Hollander. Partitioning and labeling of loops by unimodular transformations. IEEE Transactions on Parallel and Distributed Systems, 3(4):465--476, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Kejariwal, H. Saito, X. Tian, M. Girkar, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. A general approach for partitioning n-dimensional parallel nested loops with conditionals. In Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 49--58, Cambridge, MA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Kejariwal, A. Nicolau, U. Banerjee, and C. D. Polychronopoulos. A novel approach for partitioning iteration spaces with variable densities. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120--131, Chicago, IL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Kejariwal, P. D'Alberto, A. Nicolau, and C. D. Polychronopoulos. A geometric approach for partitioning N-dimensional non-rectangular iteration spaces. In Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing, pages 102--116, West Lafayette, IN, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Z. Guz, I. Keidar, A. Kolodny, and U. Weiser. Nahalal: Cache organization for chip multiprocessors. IEEE Computer Architecture Letters, 6(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Ghosh and M. Montonosi amd S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 11th ACM International Conference on Supercomputing, pages 317--324, Vienna, Austria, July 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of set-associative cache behavior. IEEE Transactions on Computers, 48(10):1009--1024, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck. Exact analysis of the cache behavior of nested loops. In Proceedings of the SIGPLAN '01 Conference on Programming Language Design and Implementation, pages 286--297, Snowbird, UT, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. B. Fraguela, R. Doallo, J. Touri no, and E. L. Zapata. A compiler tool to predict memory hierarchy performance of scientific codes. Parallel Computing, 30(2):225--248, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Polychronopoulos. Loop coalescing: A compiler transformation for parallel machines. In Proceedings of the 1987 International Conference on Parallel Processing, pages 235--242, August 1987.Google ScholarGoogle Scholar
  20. SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.Google ScholarGoogle Scholar
  21. SPEC CFP2006. http://www.spec.org/cpu2006/CFP2006.Google ScholarGoogle Scholar
  22. Intel® VTune#8482; Performance Analyzer 8.0.1 for Windows. http://www.intel.com/cd/software/products/asmo-na/eng/vtune/219898.htm.Google ScholarGoogle Scholar
  23. J. Hennessy and D. Patterson. Computer Architecture A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. OpenMP Specification, version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.Google ScholarGoogle Scholar
  25. Z. Li. Array privatization for parallel execution of loops. In Proceedings of the 1992 ACM International Conference on Supercomputing, pages 313--322, Washington, D. C, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. W. Weisstein. Abel's impossibility theorem. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/AbelsImpossibilityTheorem.html.Google ScholarGoogle Scholar
  27. D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184--1201, December 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. J. Wolfe. Iteration space tiling for memory hierarchies, December 1987.Google ScholarGoogle Scholar
  29. M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. E. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452--471, October 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 274--286, Paris, France, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. Ball and J. Larus. Branch prediction for free. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 300--313, Albuquerque, NM, June 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Krall. Improving semi-static branch prediction by code replication. In Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation, pages 97--106, Orlando, FL, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Intel® Compilers for Linux. http://www.intel.com/cd/software/products/asmo-na/eng/compilers/284264.htm.Google ScholarGoogle Scholar
  35. D. Kuck, A. H. Sameh, R. Cytron, A. Veidenbaum, C. D. Polychronopoulos, G. Lee, T. McDaniel, B. R. Leasure, C. Beckman, J. R. B Davies, and C. P. Kruskal. The effects of program restructuring, algorithm change and architecture choice on program performance. In Proceedings of the 1984 International Conference on Parallel Processing, pages 129--138, August 1984.Google ScholarGoogle Scholar
  36. M. J. Wolfe. Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge, MA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Kulkarni, K. Kumar, A. Basu, and A. Paulraj. Loop partitioning for distributed memory multiprocessors as unimodular transformations. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. O'Boyle and G. A. Hedayat. Program and data transformations for efficient execution on distributed memory architectures. Technical Report UMCS-93-1-6, Department of Computer Science, University of Manchester, 1992.Google ScholarGoogle Scholar
  39. J. Sheu and T. Thai. Partitioning and mapping nested for-loops on multiprocessor systems. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991.Google ScholarGoogle Scholar
  40. J.-P. Sheu and T.-S. Chen. Partitioning and mapping of nested loops for linear array multicomputers. Journal of Supercomputing, 9(1--2):183--202, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. I. Drositis, G. Goumas, N. Koziris, P. Tsanakas, and G. Papakonstantinou. Evaluation of loop grouping methods based on orthogonal projection spaces. In Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing, pages 469--476, August 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Asthana, H. V. Jagadish, J. A. Chandross, D. Lin, and S. C. Knauer. An intelligent memory system. SIGARCH Computer Architecture News, 16(4):12--20, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. P. Moskowitz and C. Jousselin. An algebraic memory model. SIGARCH Computer Architecture News, 17(1):55--62, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. E. Pegg., T. Rowland, and E. W. Weisstein. Cayley graph. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/CayleyGraph.html.Google ScholarGoogle Scholar
  45. J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, CA, August 1991. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. O. Temam, C. Fricker, and W. Jalby. Cache interference phenomena. In Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 261--271, Nashville, TN, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. N. Bermudo, X. Vera, A. González, and J. Llosa. Optimizing cache miss equations polyhedra. SIGARCH Computer Architecture News, 28(1):43--52, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, pages 85--96, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. F. Schneider and T. Gross. Using platform-specific performance counters for dynamic compilation. In Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing, Hawthorne, NY, October 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. F. Schneider, M. Payer, and T. Gross. Online optimizations driven by hardware performance monitoring. In Proceedings of the SIGPLAN '07 Conference on Programming Language Design and Implementation, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. V. Sarkar and B. Simons. Parallel program graphs and their classification. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, Portland, OR, August 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cache-aware partitioning of multi-dimensional iteration spaces

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        SYSTOR '09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
        May 2009
        191 pages
        ISBN:9781605586236
        DOI:10.1145/1534530

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 May 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate94of285submissions,33%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader