skip to main content
10.1145/1128022.1128027acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
Article

The potential of the cell processor for scientific computing

Authors Info & Claims
Published:03 May 2006Publication History

ABSTRACT

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

References

  1. G. Blelloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, CMU, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Cactus homepage. http://www.cactuscode.org.Google ScholarGoogle Scholar
  3. L. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cell broadband engine architecture and its first implementation. http://www-128.ibm.com/developerworks/power/library/pa-cellperf/.Google ScholarGoogle Scholar
  5. Chombo homepage. http://seesar.lbl.gov/anag/chombo.Google ScholarGoogle Scholar
  6. E. D'Azevedo, M. R. Fahey, and R. T. Mills. Vectorized sparse matrix multiply for compressed row storage format. In International Conference on Computational Science (ICCS), pages 99--106, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. FFTW speed tests. http://www.fftw.org.Google ScholarGoogle Scholar
  8. B. Flachs, S. Asano, S. Dhong, et al. A streaming processor unit for a cell processor. ISSCC Dig. Tech. Papers, pages 134--135, February 2005.Google ScholarGoogle ScholarCross RefCross Ref
  9. P. Francesco, P. Marchal, D. Atienzaothers, et al. An integrated hardware/software approach for run-time scratchpad management. In Proceedings of the 41st Design Automation Conference, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ibm cell specifications. http://www.research.ibm.com/cell/home.html.Google ScholarGoogle Scholar
  11. E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. The Berkeley Intelligent RAM (IRAM) Project. http://iram.cs.berkeley.edu.Google ScholarGoogle Scholar
  13. G. Jin, J. Mellor-Crummey, and R. Fowlerothers. Increasing temporal locality with skewing and recursive blocking. In Proc. SC2001, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Kahle, M. Day, H. Hofstee, et al. Introduction to the cell multiprocessor. IBM Journal of R&D, 49(4), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Kamil, P. Husbands, L. Oliker, et al. Impact of modern memory subsystems on cache optimizations for stencil computations. In ACM Workshop on Memory System Performance, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Kandemir, J. Ramanujam, M. Irwin, et al. Dynamic management of scratch-pad memory space. In Proceedings of the Design Automation Conference, June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Keltcher, S. Richardson, S. Siu, et al. An equal area comparison of embedded dram and sram memory architectures for a chip multiprocessor. Technical report, HP Laboratories, April 2000.Google ScholarGoogle Scholar
  18. B. Khailany, W. Dally, S. Rixner, et al. Imagine: Media processing with streams. IEEE Micro, 21(2), March-April 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Kondo, H. Okawara, H. Nakamura, et al. Scima: A novel processor architecture for high performance computing. In 4th International Conference on High Performance Computing in the Asia Pacific Region, volume 1, May 2000.Google ScholarGoogle ScholarCross RefCross Ref
  20. A. Kunimatsu, N. Ide, T. Sato, et al. Vector unit architecture for emotion synthesis. IEEE Micro, 20(2), March 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Z. Li and Y. Song. Automatic tiling of iterative stencil loops. ACM Transactions on Programming Language Systems, 26(6), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Mueller, C. Jacobi, C. Hwa-Joon, et al. The vector floating-point unit in a synergistic processor element of a cell processor. In 17th IEEE Annual Symposium on Computer Arithmetic (ISCA), June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Oka and M. Suzuoki. Designing and programming the emotion engine. IEEE Micro, 19(6), November 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Oliker, R. Biswas, J. Borrill, et al. A performance evaluation of the Cray X1 for scientific applications. In Proc. 6th International Meeting on High Performance Computing for Computational Science, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ornl cray x1 evaluation. http://www.csm.ornl.gov/~dunigan/cray.Google ScholarGoogle Scholar
  26. N. Park, B. Hong, and V. Prasanna. Analysis of memory hierarchy performance of block data layout. In International Conference on Parallel Processing (ICPP), August 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Pham, S. Asano, M. Bollier, et al. The design and implementation of a first-generation cell processor. ISSCC Dig. Tech. Papers, pages 184--185, February 2005.Google ScholarGoogle ScholarCross RefCross Ref
  28. Sony press release. http://www.scei.co.jp/corporate/release/pdf/050517e.pdf.Google ScholarGoogle Scholar
  29. M. Suzuoki et al. A microprocessor with a 128-bit cpu, ten floating point macs, four floating-point dividers, and an mpeg-2 decoder. IEEE Solid State Circuits, 34(1), November 1999.Google ScholarGoogle Scholar
  30. S. Tomar, S. Kim, N. Vijaykrishnan, et al. Use of local memory for efficient java execution. In Proceedings of the International Conference on Computer Design, September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California at Berkeley, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In International Parallel and Distributed Processing Symposium (IPDPS), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The potential of the cell processor for scientific computing

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  CF '06: Proceedings of the 3rd conference on Computing frontiers
                  May 2006
                  430 pages
                  ISBN:1595933026
                  DOI:10.1145/1128022

                  Copyright © 2006 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 3 May 2006

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                  Acceptance Rates

                  Overall Acceptance Rate240of680submissions,35%

                  Upcoming Conference

                  CF '24

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader