ABSTRACT
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.
- G. Blelloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, CMU, 1993. Google ScholarDigital Library
- Cactus homepage. http://www.cactuscode.org.Google Scholar
- L. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969. Google ScholarDigital Library
- Cell broadband engine architecture and its first implementation. http://www-128.ibm.com/developerworks/power/library/pa-cellperf/.Google Scholar
- Chombo homepage. http://seesar.lbl.gov/anag/chombo.Google Scholar
- E. D'Azevedo, M. R. Fahey, and R. T. Mills. Vectorized sparse matrix multiply for compressed row storage format. In International Conference on Computational Science (ICCS), pages 99--106, 2005. Google ScholarDigital Library
- FFTW speed tests. http://www.fftw.org.Google Scholar
- B. Flachs, S. Asano, S. Dhong, et al. A streaming processor unit for a cell processor. ISSCC Dig. Tech. Papers, pages 134--135, February 2005.Google ScholarCross Ref
- P. Francesco, P. Marchal, D. Atienzaothers, et al. An integrated hardware/software approach for run-time scratchpad management. In Proceedings of the 41st Design Automation Conference, June 2004. Google ScholarDigital Library
- Ibm cell specifications. http://www.research.ibm.com/cell/home.html.Google Scholar
- E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications, 2004. Google ScholarDigital Library
- The Berkeley Intelligent RAM (IRAM) Project. http://iram.cs.berkeley.edu.Google Scholar
- G. Jin, J. Mellor-Crummey, and R. Fowlerothers. Increasing temporal locality with skewing and recursive blocking. In Proc. SC2001, 2001. Google ScholarDigital Library
- J. Kahle, M. Day, H. Hofstee, et al. Introduction to the cell multiprocessor. IBM Journal of R&D, 49(4), 2005. Google ScholarDigital Library
- S. Kamil, P. Husbands, L. Oliker, et al. Impact of modern memory subsystems on cache optimizations for stencil computations. In ACM Workshop on Memory System Performance, June 2005. Google ScholarDigital Library
- M. Kandemir, J. Ramanujam, M. Irwin, et al. Dynamic management of scratch-pad memory space. In Proceedings of the Design Automation Conference, June 2001. Google ScholarDigital Library
- P. Keltcher, S. Richardson, S. Siu, et al. An equal area comparison of embedded dram and sram memory architectures for a chip multiprocessor. Technical report, HP Laboratories, April 2000.Google Scholar
- B. Khailany, W. Dally, S. Rixner, et al. Imagine: Media processing with streams. IEEE Micro, 21(2), March-April 2001. Google ScholarDigital Library
- M. Kondo, H. Okawara, H. Nakamura, et al. Scima: A novel processor architecture for high performance computing. In 4th International Conference on High Performance Computing in the Asia Pacific Region, volume 1, May 2000.Google ScholarCross Ref
- A. Kunimatsu, N. Ide, T. Sato, et al. Vector unit architecture for emotion synthesis. IEEE Micro, 20(2), March 2000. Google ScholarDigital Library
- Z. Li and Y. Song. Automatic tiling of iterative stencil loops. ACM Transactions on Programming Language Systems, 26(6), 2004. Google ScholarDigital Library
- S. Mueller, C. Jacobi, C. Hwa-Joon, et al. The vector floating-point unit in a synergistic processor element of a cell processor. In 17th IEEE Annual Symposium on Computer Arithmetic (ISCA), June 2005. Google ScholarDigital Library
- M. Oka and M. Suzuoki. Designing and programming the emotion engine. IEEE Micro, 19(6), November 1999. Google ScholarDigital Library
- L. Oliker, R. Biswas, J. Borrill, et al. A performance evaluation of the Cray X1 for scientific applications. In Proc. 6th International Meeting on High Performance Computing for Computational Science, 2004. Google ScholarDigital Library
- Ornl cray x1 evaluation. http://www.csm.ornl.gov/~dunigan/cray.Google Scholar
- N. Park, B. Hong, and V. Prasanna. Analysis of memory hierarchy performance of block data layout. In International Conference on Parallel Processing (ICPP), August 2002. Google ScholarDigital Library
- D. Pham, S. Asano, M. Bollier, et al. The design and implementation of a first-generation cell processor. ISSCC Dig. Tech. Papers, pages 184--185, February 2005.Google ScholarCross Ref
- Sony press release. http://www.scei.co.jp/corporate/release/pdf/050517e.pdf.Google Scholar
- M. Suzuoki et al. A microprocessor with a 128-bit cpu, ten floating point macs, four floating-point dividers, and an mpeg-2 decoder. IEEE Solid State Circuits, 34(1), November 1999.Google Scholar
- S. Tomar, S. Kim, N. Vijaykrishnan, et al. Use of local memory for efficient java execution. In Proceedings of the International Conference on Computer Design, September 2001. Google ScholarDigital Library
- R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California at Berkeley, 2003. Google ScholarDigital Library
- D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In International Parallel and Distributed Processing Symposium (IPDPS), 2000. Google ScholarDigital Library
Index Terms
- The potential of the cell processor for scientific computing
Recommendations
Scientific computing Kernels on the cell processor
In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for ...
Scientific Computing Kernels on the Cell Processor
In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell ...
Optimization of BLAS on the cell processor
HiPC'08: Proceedings of the 15th international conference on High performance computingThe unique architecture of the heterogeneous multicore Cell processor offers great potential for high performance computing.It offers features such as high memory bandwidth using DMA, usermanaged local stores and SIMD architecture. In this paper, we ...
Comments