Article

The potential of the cell processor for scientific computing

Authors:
Samuel Williams

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
John Shalf

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Leonid Oliker

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Shoaib Kamil

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Parry Husbands

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Katherine Yelick

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

CF '06: Proceedings of the 3rd conference on Computing frontiersMay 2006Pages 9–20https://doi.org/10.1145/1128022.1128027

Published:03 May 2006Publication History

CF '06: Proceedings of the 3rd conference on Computing frontiers

Pages 9–20

ABSTRACT

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

References

G. Blelloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, CMU, 1993. Google ScholarDigital Library
Cactus homepage. http://www.cactuscode.org.Google Scholar
L. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969. Google ScholarDigital Library
Cell broadband engine architecture and its first implementation. http://www-128.ibm.com/developerworks/power/library/pa-cellperf/.Google Scholar
Chombo homepage. http://seesar.lbl.gov/anag/chombo.Google Scholar
E. D'Azevedo, M. R. Fahey, and R. T. Mills. Vectorized sparse matrix multiply for compressed row storage format. In International Conference on Computational Science (ICCS), pages 99--106, 2005. Google ScholarDigital Library
FFTW speed tests. http://www.fftw.org.Google Scholar
B. Flachs, S. Asano, S. Dhong, et al. A streaming processor unit for a cell processor. ISSCC Dig. Tech. Papers, pages 134--135, February 2005.Google ScholarCross Ref
P. Francesco, P. Marchal, D. Atienzaothers, et al. An integrated hardware/software approach for run-time scratchpad management. In Proceedings of the 41st Design Automation Conference, June 2004. Google ScholarDigital Library
Ibm cell specifications. http://www.research.ibm.com/cell/home.html.Google Scholar
E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications, 2004. Google ScholarDigital Library
The Berkeley Intelligent RAM (IRAM) Project. http://iram.cs.berkeley.edu.Google Scholar
G. Jin, J. Mellor-Crummey, and R. Fowlerothers. Increasing temporal locality with skewing and recursive blocking. In Proc. SC2001, 2001. Google ScholarDigital Library
J. Kahle, M. Day, H. Hofstee, et al. Introduction to the cell multiprocessor. IBM Journal of R&D, 49(4), 2005. Google ScholarDigital Library
S. Kamil, P. Husbands, L. Oliker, et al. Impact of modern memory subsystems on cache optimizations for stencil computations. In ACM Workshop on Memory System Performance, June 2005. Google ScholarDigital Library
M. Kandemir, J. Ramanujam, M. Irwin, et al. Dynamic management of scratch-pad memory space. In Proceedings of the Design Automation Conference, June 2001. Google ScholarDigital Library
P. Keltcher, S. Richardson, S. Siu, et al. An equal area comparison of embedded dram and sram memory architectures for a chip multiprocessor. Technical report, HP Laboratories, April 2000.Google Scholar
B. Khailany, W. Dally, S. Rixner, et al. Imagine: Media processing with streams. IEEE Micro, 21(2), March-April 2001. Google ScholarDigital Library
M. Kondo, H. Okawara, H. Nakamura, et al. Scima: A novel processor architecture for high performance computing. In 4th International Conference on High Performance Computing in the Asia Pacific Region, volume 1, May 2000.Google ScholarCross Ref
A. Kunimatsu, N. Ide, T. Sato, et al. Vector unit architecture for emotion synthesis. IEEE Micro, 20(2), March 2000. Google ScholarDigital Library
Z. Li and Y. Song. Automatic tiling of iterative stencil loops. ACM Transactions on Programming Language Systems, 26(6), 2004. Google ScholarDigital Library
S. Mueller, C. Jacobi, C. Hwa-Joon, et al. The vector floating-point unit in a synergistic processor element of a cell processor. In 17th IEEE Annual Symposium on Computer Arithmetic (ISCA), June 2005. Google ScholarDigital Library
M. Oka and M. Suzuoki. Designing and programming the emotion engine. IEEE Micro, 19(6), November 1999. Google ScholarDigital Library
L. Oliker, R. Biswas, J. Borrill, et al. A performance evaluation of the Cray X1 for scientific applications. In Proc. 6th International Meeting on High Performance Computing for Computational Science, 2004. Google ScholarDigital Library
Ornl cray x1 evaluation. http://www.csm.ornl.gov/~dunigan/cray.Google Scholar
N. Park, B. Hong, and V. Prasanna. Analysis of memory hierarchy performance of block data layout. In International Conference on Parallel Processing (ICPP), August 2002. Google ScholarDigital Library
D. Pham, S. Asano, M. Bollier, et al. The design and implementation of a first-generation cell processor. ISSCC Dig. Tech. Papers, pages 184--185, February 2005.Google ScholarCross Ref
Sony press release. http://www.scei.co.jp/corporate/release/pdf/050517e.pdf.Google Scholar
M. Suzuoki et al. A microprocessor with a 128-bit cpu, ten floating point macs, four floating-point dividers, and an mpeg-2 decoder. IEEE Solid State Circuits, 34(1), November 1999.Google Scholar
S. Tomar, S. Kim, N. Vijaykrishnan, et al. Use of local memory for efficient java execution. In Proceedings of the International Conference on Computer Design, September 2001. Google ScholarDigital Library
R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California at Berkeley, 2003. Google ScholarDigital Library
D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In International Parallel and Distributed Processing Symposium (IPDPS), 2000. Google ScholarDigital Library

Index Terms

Recommendations

Scientific computing Kernels on the cell processor

In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for ...
Read More
Scientific Computing Kernels on the Cell Processor
In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell ...
Read More
Optimization of BLAS on the cell processor
HiPC'08: Proceedings of the 15th international conference on High performance computing

The unique architecture of the heterogeneous multicore Cell processor offers great potential for high performance computing.It offers features such as high memory bandwidth using DMA, usermanaged local stores and SIMD architecture. In this paper, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CF '06: Proceedings of the 3rd conference on Computing frontiers
May 2006
430 pages
ISBN:1595933026
DOI:10.1145/1128022
General Chairs:
Monica Alderighi
IASF - INAF
,
Valentina Salapura
IBM
,
Program Chair:
Sally A. McKee
Cornell University
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FFT
GEMM
SpMV
cell processor
sparse matrix
stencil
three level memory
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate240of680submissions,35%
Upcoming Conference
CF '24

Sponsor:

sigmicro

21st ACM International Conference on Computing Frontiers

May 7 - 9, 2024

Ischia , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 212
  Total Citations
  View Citations
- 2,426
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The potential of the cell processor for scientific computing

CF '06: Proceedings of the 3rd conference on Computing frontiers

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scientific computing Kernels on the cell processor

Scientific Computing Kernels on the Cell Processor

Optimization of BLAS on the cell processor