ABSTRACT
Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughput-oriented processors for sparse operations requires that we expose substantial fine-grained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system.
- E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. Google ScholarDigital Library
- S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, and E. S. Quintana-Ortí. Solving dense linear systems on graphics processors. In Proc. 14th Int'l Euro-Par Conference, volume 5168 of Lecture Notes in Computer Science, pages 739--748. Springer, Aug. 2008. Google ScholarDigital Library
- M. M. Baskaran and R. Bordawekar. Optimizing sparse matrix-vector multiplication on GPUs. IBM Research Report RC24704, IBM, Apr. 2009.Google Scholar
- N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.Google Scholar
- N. Bell and M. Garland. CUSP: Generic parallel algorithms for sparse matrix and graph computations. http://code.google.com/p/cusp-library/, 2009-.Google Scholar
- G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. Journal of Parallel and Distributed Computing, 21(1):4--14, Apr. 1994. Google ScholarDigital Library
- G. E. Blelloch, M. A. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, School of Computer Science, Carnegie Mellon University, Aug. 1993. Google ScholarDigital Library
- L. Buatois, G. Caumon, and B. Lévy. Concurrent number cruncher - a GPU implementation of a general sparse linear solver. International Journal of Parallel, Emergent and Distributed Systems, to appear. Google ScholarDigital Library
- CUDPP: CUDA Data-Parallel Primitives Library. http://www.gpgpu.org/developer/cudpp/, 2009.Google Scholar
- M. Garland. Sparse matrix computations on manycore GPU's. In DAC '08: Proc. 45th Annual Design Automation Conference, pages 2--6, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- R. Grimes, D. Kincaid, and D. Young. ITPACK 2.0 User's Guide. Technical Report CNA-150, Center for Numerical Analysis, University of Texas, Aug. 1979.Google Scholar
- E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl., 18(1):135--158, 2004. Google ScholarDigital Library
- A. V. Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM Journal on Scientific Computing, 23(2):517--541, 2002. Google ScholarDigital Library
- C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software (TOMS), 5(3):308--323, 1979. Google ScholarDigital Library
- E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, Mar/Apr 2008. Google ScholarDigital Library
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. Queue, 6(2):40--53, Mar/Apr 2008. Google ScholarDigital Library
- NVIDIA Corporation. NVIDIA CUDA Programming Guide, June 2008. Version 2.0.Google Scholar
- Y. Saad. SPARSKIT: A basic tool kit for sparse computations; Version 2, June 1994.Google Scholar
- Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial Mathematics, 2003. Google ScholarDigital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Graphics Hardware 2007, pages 97--106. ACM, Aug. 2007. Google ScholarDigital Library
- V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. 2008 ACM/IEEE Conference on Supercomputing, pages 1--11, Nov. 2008. Google ScholarDigital Library
- R. W. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, 2003. Google ScholarDigital Library
- S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proc. 2007 ACM/IEEE Conference on Supercomputing, 2007. Google ScholarDigital Library
Index Terms
- Implementing sparse matrix-vector multiplication on throughput-oriented processors
Recommendations
Efficient sparse matrix-vector multiplication on x86-based many-core processors
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingSparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks ...
GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication
Many high performance computing applications require computing both sparse matrix-vector product SMVP and sparse matrix-transpose vector product SMTVP for better overall performance. Under such a circumstance, it is critical to maintain a similarly high ...
On Implementing Sparse Matrix Multi-vector Multiplication on GPUs
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these ...
Comments