research-article

Cache-aware partitioning of multi-dimensional iteration spaces

Authors:
Arun Kejariwal

Yahoo! Inc., Santa Clara, CA

Yahoo! Inc., Santa Clara, CA
View Profile

,
Alexandru Nicolau

University of California, Irvine, CA

University of California, Irvine, CA
View Profile

,
Utpal Banerjee

University of California, Irvine, CA

University of California, Irvine, CA
View Profile

,
Alexander V. Veidenbaum

University of California, Irvine, CA

University of California, Irvine, CA
View Profile

,
Constantine D. Polychronopoulos

University of Illinois at Urbana-Champaign, Urbana, IL

University of Illinois at Urbana-Champaign, Urbana, IL
View Profile

SYSTOR '09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems ConferenceMay 2009Article No.: 15Pages 1–12https://doi.org/10.1145/1534530.1534551

Published:04 May 2009Publication History

SYSTOR '09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference

Pages 1–12

ABSTRACT

The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.

References

K. Olukotun and L. Hammond. The future of microprocessors. ACM Queue, 3(7):26--29, 2005. Google ScholarDigital Library
Teraflops Research Chip. http://www.intel.com/research/platform/terascale/teraflops.htm.Google Scholar
H. Sutter and J. Larus. Software and the concurrency revolution. ACM Queue, 3(7), 2005. Google ScholarDigital Library
S. F. Lundstrom and G. H. Barnes. A controllable MIMD architectures. In Proceedings of the 1980 International Conference on Parallel Processing, pages 19--27, St. Charles, IL, August 1980.Google Scholar
SPEC CFP2000. http://www.spec.org/cpu2000/CFP2000.Google Scholar
M. R. Haghighat and Constantine D. Polychronopoulos. Symbolic analysis for parallelizing compilers. ACM Transactions on Programming Languages and Systems, 18(4):477--518, July 1996. Google ScholarDigital Library
R. Sakellariou. On the Quest for Perfect Load Balance in Loop-Based Parallel Computations. PhD thesis, Department of Computer Science, University of Manchester, October 1996.Google Scholar
C. Polychronopoulos, D. J. Kuck, and D. A. Padua. Execution of parallel loops on parallel processor systems. In Proceedings of the 1986 International Conference on Parallel Processing, pages 519--527, August 1986.Google Scholar
E. H. D'Hollander. Partitioning and labeling of loops by unimodular transformations. IEEE Transactions on Parallel and Distributed Systems, 3(4):465--476, 1992. Google ScholarDigital Library
A. Kejariwal, H. Saito, X. Tian, M. Girkar, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. A general approach for partitioning n-dimensional parallel nested loops with conditionals. In Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 49--58, Cambridge, MA, 2006. Google ScholarDigital Library
A. Kejariwal, A. Nicolau, U. Banerjee, and C. D. Polychronopoulos. A novel approach for partitioning iteration spaces with variable densities. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120--131, Chicago, IL, 2005. Google ScholarDigital Library
A. Kejariwal, P. D'Alberto, A. Nicolau, and C. D. Polychronopoulos. A geometric approach for partitioning N-dimensional non-rectangular iteration spaces. In Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing, pages 102--116, West Lafayette, IN, 2004. Google ScholarDigital Library
M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996. Google ScholarDigital Library
Z. Guz, I. Keidar, A. Kolodny, and U. Weiser. Nahalal: Cache organization for chip multiprocessors. IEEE Computer Architecture Letters, 6(1), 2007. Google ScholarDigital Library
S. Ghosh and M. Montonosi amd S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 11th ACM International Conference on Supercomputing, pages 317--324, Vienna, Austria, July 1997. Google ScholarDigital Library
J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of set-associative cache behavior. IEEE Transactions on Computers, 48(10):1009--1024, 1999. Google ScholarDigital Library
S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck. Exact analysis of the cache behavior of nested loops. In Proceedings of the SIGPLAN '01 Conference on Programming Language Design and Implementation, pages 286--297, Snowbird, UT, 2001. Google ScholarDigital Library
B. B. Fraguela, R. Doallo, J. Touri no, and E. L. Zapata. A compiler tool to predict memory hierarchy performance of scientific codes. Parallel Computing, 30(2):225--248, 2004. Google ScholarDigital Library
C. Polychronopoulos. Loop coalescing: A compiler transformation for parallel machines. In Proceedings of the 1987 International Conference on Parallel Processing, pages 235--242, August 1987.Google Scholar
SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.Google Scholar
SPEC CFP2006. http://www.spec.org/cpu2006/CFP2006.Google Scholar
Intel® VTune#8482; Performance Analyzer 8.0.1 for Windows. http://www.intel.com/cd/software/products/asmo-na/eng/vtune/219898.htm.Google Scholar
J. Hennessy and D. Patterson. Computer Architecture A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA, 1990. Google ScholarDigital Library
OpenMP Specification, version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.Google Scholar
Z. Li. Array privatization for parallel execution of loops. In Proceedings of the 1992 ACM International Conference on Supercomputing, pages 313--322, Washington, D. C, 1992. Google ScholarDigital Library
E. W. Weisstein. Abel's impossibility theorem. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/AbelsImpossibilityTheorem.html.Google Scholar
D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184--1201, December 1986. Google ScholarDigital Library
M. J. Wolfe. Iteration space tiling for memory hierarchies, December 1987.Google Scholar
M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991. Google ScholarDigital Library
M. E. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452--471, October 1991. Google ScholarDigital Library
M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 274--286, Paris, France, 1996. Google ScholarDigital Library
T. Ball and J. Larus. Branch prediction for free. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 300--313, Albuquerque, NM, June 1993. Google ScholarDigital Library
A. Krall. Improving semi-static branch prediction by code replication. In Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation, pages 97--106, Orlando, FL, 1994. Google ScholarDigital Library
Intel® Compilers for Linux. http://www.intel.com/cd/software/products/asmo-na/eng/compilers/284264.htm.Google Scholar
D. Kuck, A. H. Sameh, R. Cytron, A. Veidenbaum, C. D. Polychronopoulos, G. Lee, T. McDaniel, B. R. Leasure, C. Beckman, J. R. B Davies, and C. P. Kruskal. The effects of program restructuring, algorithm change and architecture choice on program performance. In Proceedings of the 1984 International Conference on Parallel Processing, pages 129--138, August 1984.Google Scholar
M. J. Wolfe. Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge, MA, 1989. Google ScholarDigital Library
D. Kulkarni, K. Kumar, A. Basu, and A. Paulraj. Loop partitioning for distributed memory multiprocessors as unimodular transformations. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991. Google ScholarDigital Library
M. O'Boyle and G. A. Hedayat. Program and data transformations for efficient execution on distributed memory architectures. Technical Report UMCS-93-1-6, Department of Computer Science, University of Manchester, 1992.Google Scholar
J. Sheu and T. Thai. Partitioning and mapping nested for-loops on multiprocessor systems. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991.Google Scholar
J.-P. Sheu and T.-S. Chen. Partitioning and mapping of nested loops for linear array multicomputers. Journal of Supercomputing, 9(1--2):183--202, 1995. Google ScholarDigital Library
I. Drositis, G. Goumas, N. Koziris, P. Tsanakas, and G. Papakonstantinou. Evaluation of loop grouping methods based on orthogonal projection spaces. In Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing, pages 469--476, August 2000. Google ScholarDigital Library
A. Asthana, H. V. Jagadish, J. A. Chandross, D. Lin, and S. C. Knauer. An intelligent memory system. SIGARCH Computer Architecture News, 16(4):12--20, 1988. Google ScholarDigital Library
J. P. Moskowitz and C. Jousselin. An algebraic memory model. SIGARCH Computer Architecture News, 17(1):55--62, 1989. Google ScholarDigital Library
E. Pegg., T. Rowland, and E. W. Weisstein. Cayley graph. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/CayleyGraph.html.Google Scholar
J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, CA, August 1991. Springer-Verlag. Google ScholarDigital Library
O. Temam, C. Fricker, and W. Jalby. Cache interference phenomena. In Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 261--271, Nashville, TN, 1994. Google ScholarDigital Library
N. Bermudo, X. Vera, A. González, and J. Llosa. Optimizing cache miss equations polyhedra. SIGARCH Computer Architecture News, 28(1):43--52, 2000. Google ScholarDigital Library
G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, pages 85--96, 1997. Google ScholarDigital Library
F. Schneider and T. Gross. Using platform-specific performance counters for dynamic compilation. In Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing, Hawthorne, NY, October 2005. Google ScholarDigital Library
F. Schneider, M. Payer, and T. Gross. Online optimizations driven by hardware performance monitoring. In Proceedings of the SIGPLAN '07 Conference on Programming Language Design and Implementation, 2007. Google ScholarDigital Library
V. Sarkar and B. Simons. Parallel program graphs and their classification. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, Portland, OR, August 1993. Google ScholarDigital Library

Index Terms

Cache-aware partitioning of multi-dimensional iteration spaces
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

A novel approach for partitioning iteration spaces with variable densities
PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming

Efficient partitioning of parallel loops plays a critical role in high performance and efficient use of multiprocessor systems. Although a significant amount of work has been done in partitioning and scheduling of loops with rectangular iteration spaces,...
Read More
A general approach for partitioning N-dimensional parallel nested loops with conditionals
SPAA '06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures

Parallel loops account for the greatest amount of parallelism in scientific and numerical codes. For example, most of the DO loops in SPEC CFP2000 and SPEC OMPM2001 are of DOALL type and account for a large percentage of the total execution time. One of ...
Read More
An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing

Parallel processing systems with cache or local memory in the memory hierarchies are considered. These systems have a local cache memory in each processor and usually employ a write-invalidate protocol for the cache coherence. In such systems, a problem ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SYSTOR '09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
May 2009
191 pages
ISBN:9781605586236
DOI:10.1145/1534530
General Chair:
Miriam Allalouf
IBM Research
,
Program Chairs:
Michael Factor
IBM Research
,
Dror Feitelson
HUJI
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 May 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
iteration space
parallel loops
partitioning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate94of285submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 279
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cache-aware partitioning of multi-dimensional iteration spaces

SYSTOR '09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

A novel approach for partitioning iteration spaces with variable densities

A general approach for partitioning N-dimensional parallel nested loops with conditionals

An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing