Article

Implicit and explicit optimizations for stencil computations

Authors:
Shoaib Kamil

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Kaushik Datta

University of California, Berkeley, CA

University of California, Berkeley, CA
View Profile

,
Samuel Williams

University of California, Berkeley, CA

University of California, Berkeley, CA
View Profile

,
Leonid Oliker

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
John Shalf

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Katherine Yelick

Lawrence Berkeley National Laboratory, Berkeley, CA and University of California, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA and University of California, Berkeley, CA
View Profile

MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctnessOctober 2006Pages 51–60https://doi.org/10.1145/1178597.1178605

Published:22 October 2006Publication History

MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness

Pages 51–60

ABSTRACT

Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.

References

Applied Numerical Algorithms Group (ANAG), Lawrence Berkeley National Laboratory, Berkeley, CA. Chombo website. http://seesar.lbl.gov/ANAG/software.html.Google Scholar
M. Berger and J. Oliger. Adaptive mesh refinement for hyperbolic partial differential equations. Journal of Computational Physics, 53:484--512, 1984.Google ScholarCross Ref
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms (extended abstract).Google Scholar
M. Frigo and V. Strumpen. Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations. In Proc. of the 19th ACM International Conference on Supercomputing (ICS05), Boston, MA, 2005. Google ScholarDigital Library
S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In 3rd Annual ACM SIGPLAN Workshop on Memory Systems Performance, Chicago, IL, 2005. Google ScholarDigital Library
J. McCalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, Department of Computer Science, Rugers University, 1999.Google Scholar
Performance Application Programming Interface. http://icl.cs.utk.edu/papi/.Google Scholar
H. Prokop. Cache-oblivious algorithms, June 1999. Master's thesis, MIT Department of Electrical Engineering and Computer Science.Google Scholar
S. Sellappa and S. Chatterjee. Cache-efficient multigrid algorithms. International Journal of High Performance Computing Applications, 18(1):115--133, 2004. Google ScholarDigital Library
Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, Atlanta, GA, 1999. Google ScholarDigital Library
S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. The potential of the cell processor for scientific computing. In CF '06: Proceedings of the 3rd conference on Computing Frontiers, pages 9--20, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
M. E. Wolf. Improving locality and parallelism in nested loops. PhD thesis, Stanford University, Stanford, CA, USA, 1992. Google ScholarDigital Library
D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS: Interational Conference on Parallel and Distributed Computing Systems, Cancun, Mexico, 2000. Google ScholarDigital Library

Index Terms

Implicit and explicit optimizations for stencil computations
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Cache oblivious stencil computations
ICS '05: Proceedings of the 19th annual international conference on Supercomputing

We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an "ideal cache" of size Z, our algorithm saves a factor of Θ(...
Read More
Impact of modern memory subsystems on cache optimizations for stencil computations
MSP '05: Proceedings of the 2005 workshop on Memory system performance

In this work we investigate the impact of evolving memory system features, such as large on-chip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations. These calculations form the basis for a wide range of ...
Read More
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness
October 2006
114 pages
ISBN:1595935789
DOI:10.1145/1178597
General Chair:
Antony Hosking
Purdue U
,
Program Chair:
Ali-Reza Adl-Tabatabai
Intel
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 October 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate6of20submissions,30%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 97
  Total Citations
  View Citations
- 705
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Implicit and explicit optimizations for stencil computations

MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cache oblivious stencil computations

Impact of modern memory subsystems on cache optimizations for stencil computations

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors