research-article

Implementing sparse matrix-vector multiplication on throughput-oriented processors

Authors:
Nathan Bell

NVIDIA Research

NVIDIA Research
View Profile

,
Michael Garland

NVIDIA Research

NVIDIA Research
View Profile

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and AnalysisNovember 2009Article No.: 18Pages 1–11https://doi.org/10.1145/1654059.1654078

Published:14 November 2009Publication History

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Pages 1–11

ABSTRACT

Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughput-oriented processors for sparse operations requires that we expose substantial fine-grained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system.

References

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. Google ScholarDigital Library
S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, and E. S. Quintana-Ortí. Solving dense linear systems on graphics processors. In Proc. 14th Int'l Euro-Par Conference, volume 5168 of Lecture Notes in Computer Science, pages 739--748. Springer, Aug. 2008. Google ScholarDigital Library
M. M. Baskaran and R. Bordawekar. Optimizing sparse matrix-vector multiplication on GPUs. IBM Research Report RC24704, IBM, Apr. 2009.Google Scholar
N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.Google Scholar
N. Bell and M. Garland. CUSP: Generic parallel algorithms for sparse matrix and graph computations. http://code.google.com/p/cusp-library/, 2009-.Google Scholar
G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. Journal of Parallel and Distributed Computing, 21(1):4--14, Apr. 1994. Google ScholarDigital Library
G. E. Blelloch, M. A. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, School of Computer Science, Carnegie Mellon University, Aug. 1993. Google ScholarDigital Library
L. Buatois, G. Caumon, and B. Lévy. Concurrent number cruncher - a GPU implementation of a general sparse linear solver. International Journal of Parallel, Emergent and Distributed Systems, to appear. Google ScholarDigital Library
CUDPP: CUDA Data-Parallel Primitives Library. http://www.gpgpu.org/developer/cudpp/, 2009.Google Scholar
M. Garland. Sparse matrix computations on manycore GPU's. In DAC '08: Proc. 45th Annual Design Automation Conference, pages 2--6, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
R. Grimes, D. Kincaid, and D. Young. ITPACK 2.0 User's Guide. Technical Report CNA-150, Center for Numerical Analysis, University of Texas, Aug. 1979.Google Scholar
E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl., 18(1):135--158, 2004. Google ScholarDigital Library
A. V. Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM Journal on Scientific Computing, 23(2):517--541, 2002. Google ScholarDigital Library
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software (TOMS), 5(3):308--323, 1979. Google ScholarDigital Library
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, Mar/Apr 2008. Google ScholarDigital Library
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. Queue, 6(2):40--53, Mar/Apr 2008. Google ScholarDigital Library
NVIDIA Corporation. NVIDIA CUDA Programming Guide, June 2008. Version 2.0.Google Scholar
Y. Saad. SPARSKIT: A basic tool kit for sparse computations; Version 2, June 1994.Google Scholar
Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial Mathematics, 2003. Google ScholarDigital Library
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Graphics Hardware 2007, pages 97--106. ACM, Aug. 2007. Google ScholarDigital Library
V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. 2008 ACM/IEEE Conference on Supercomputing, pages 1--11, Nov. 2008. Google ScholarDigital Library
R. W. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, 2003. Google ScholarDigital Library
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proc. 2007 ACM/IEEE Conference on Supercomputing, 2007. Google ScholarDigital Library

Index Terms

Implementing sparse matrix-vector multiplication on throughput-oriented processors

Recommendations

Efficient sparse matrix-vector multiplication on x86-based many-core processors
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks ...
Read More
GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Many high performance computing applications require computing both sparse matrix-vector product SMVP and sparse matrix-transpose vector product SMTVP for better overall performance. Under such a circumstance, it is critical to maintain a similarly high ...
Read More
On Implementing Sparse Matrix Multi-vector Multiplication on GPUs
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)

Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
November 2009
778 pages
ISBN:9781605587448
DOI:10.1145/1654059
Conference Chair:
Wilfred Pinfold
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 November 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SC '09 Paper Acceptance Rate59of261submissions,23%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 598
  Total Citations
  View Citations
- 651
  Total Downloads
- Downloads (Last 12 months)148
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Implementing sparse matrix-vector multiplication on throughput-oriented processors

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient sparse matrix-vector multiplication on x86-based many-core processors

GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Implementing sparse matrix-vector multiplication on throughput-oriented processors

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient sparse matrix-vector multiplication on x86-based many-core processors

GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media