research-article

Cache efficient bidiagonalization using BLAS 2.5 operators

Authors:
Gary W. Howell

North Carolina State University, Raleigh, NC

North Carolina State University, Raleigh, NC
View Profile

,
James W. Demmel

University of California, Berkeley, CA

University of California, Berkeley, CA
View Profile

,
Charles T. Fulton

Florida Institute of Technology, Melbourne, FL

Florida Institute of Technology, Melbourne, FL
View Profile

,
Sven Hammarling

University of Manchester, UK

University of Manchester, UK
View Profile

,
Karen Marmol

Harris Corporation, Melbourne, FL

Harris Corporation, Melbourne, FL
View Profile

ACM Transactions on Mathematical Software Volume 34 Issue 3Article No.: 14pp 1–33https://doi.org/10.1145/1356052.1356055

Published:16 May 2008Publication History

ACM Transactions on Mathematical Software

Abstract

On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a general m × n matrix, so that two (_GEMV) vector-matrix multiplications can be done with one pass of the unreduced trailing part of the matrix through cache. Two new BLAS operations approximately cut in half the transfer of data from main memory to cache, reducing execution times by up to 25 per cent. We give detailed algorithm descriptions and compare timings with the current LAPACK bidiagonalization algorithm.

References

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Greenbaum, A., Hammarling, S., Mckenney, A., and Sorensen, D. 1999. LAPACK User's Guide, 3rd. Ed. SIAM, Philadelphia, PA. Google ScholarDigital Library
Barlow, J. L., Bosner, N., and Drmač, Z. 2000. A new stable bidiagonal reduction algorithm. Lin. Alg. Appl. 397, 35--84.Google ScholarCross Ref
Berry, M. 1992. Large scale singular value computations. Internat. J. Supercomput. Appl. 6, 13--49.Google ScholarDigital Library
Berry, M., Do, T., O'brien, G., Krishna, V., and Varadhan, S. 1993. SVDPACKC: version 1.0 user's guide, Tech. rep. CS-93-194, University of Tennessee, Knoxville, TN. Google ScholarDigital Library
Berry, M., Dumais, S., and O'brien, G. 1995. Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 4 573--595. Google ScholarDigital Library
Bischof, C. H. and Van Loan, C. F. 1987. The WY representaion of products of Householder matrices, SIAM J. Sci. Stat. Comput. 8, s2--s13. Google ScholarDigital Library
Blackford, S. and Dongarra, J. 1999. Installation guide for LAPACK, LAPACK Working Note 41.Google Scholar
Blackford, L. S., Corliss, G., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Hu, C., Kahan, W., Kaufmann, L., Kearfott, B., Frogh, F., Li, X., Maany, Z., Petitet, A., Pozo, R., Remington, K., Walster, W., Whaley, C., Wolff, V., Gudenberg, J., and Lumsdaine, A. 2002. Basic linear algebra subprograms technical (BLAST) forum standard, Int. J. High Perform. Comput. 12, 1--2 (www.netlib.org/blas/blast-forum).Google Scholar
Blackford, S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufmann, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., and Whaley, C. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2, 135--151. Google ScholarDigital Library
Bosner, N. and Barlow, J. L. 2005. Block and parallel versions of one-sided bidiagonalization. preprint http://www.cse.psu.edu/barlow/block_bidiag.pdf.Google Scholar
Choi, J., Dongarra, J., and Walker, D. 1995. The design of a parallel dense linear algebra software library: reduction to Hessenberg, tridiagonal, and bidiagonal form. (LAPACK Working Note &num; 92) Num. Alg., 10, 379--399.Google Scholar
Dhillon, I. S. 1997. A New O(n²) Algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. PhD thesis, University of California, Berkeley, CA. Google ScholarDigital Library
Dongarra, J., Hammarling, S., and Sorensen, D. 1989. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27, 215--227.Google ScholarCross Ref
Dongarra, J., Duff, I., Sorensen, D., and Van Der Vorst, H. 1988. Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia, PA. Google ScholarDigital Library
Douglas, C. C., Haase, G., Hu, J., Kowarschik, M., Rüde, U., and Weiss, C. 2000. Portable memory heirarchy techniques for PDE solvers: Part I. SIAM News, 33, 5.Google Scholar
Fernando, V., Parlett, B., and Dhillon, I. 1995. A way to find the most redundant equation in a tridiagonal system. Mathematics Department, University of California, Berkeley.Google Scholar
Goedecker, S. and Hoise, A. 2001. Performance Optimization for Numerically Intensive Codes, SIAM, Philadelphia, PA. Google ScholarDigital Library
Golub, G. and Kahan, W. 1965. Calculating the singular Values and pseudo-inverse of a matrix, SIAM J. Num. Anal., 2, 205--224.Google Scholar
Golub, G. and Reinsch, C. 1970. Singular value decomposition and least squares solution. Numer. Math. 14, 403--420.Google ScholarDigital Library
Golub, G. and Van Loan, C. F. 1996. Matrix Computations 3rd Ed. Johns Hopkins University Press, Baltimore, MA. Google ScholarDigital Library
Grösser, B. and Lang, B. 1988. Efficient parallel reduction to bidiagonal form, Preprint BUGHW-SC 98/2. http://www.math.uni-wuppertal/org/SciComp/Preprint/SC9802ips.gz.Google Scholar
Howell, G. W. 2001. Sparse Householder bidiagonalization. CERFACS Sparse Days. http://ncsu. edu/itd/hpc/Documents/Publications/gary_howell/cerfacs01.psGoogle Scholar
Lang, B. 1996. Parallel reduction of banded matrices to bidiagonal form. Parall. Comput., 22, 1--18. Google ScholarDigital Library
Owens, B. 2003. A Matlab script for 2-blocking to speed Ralha-Barlow one-sided bidiagonalization. Summer intern project at ERDC, MSRC, Vicksburg, MS. http://ncsu.edu/itd/hpc/Documents/Publications/gary_howell/barlow3.m.Google Scholar
Paige, C. and Saunders, M. 1982. An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw. 8, 1, 43--71. Google ScholarDigital Library
Parlett, B. and Dhillon, I. 1997. Fernando's solution to Wilkinson's problem: An application of double factorization. Lin. Alg. Appl. 267, 247--279.Google ScholarCross Ref
Ralha, R. M. S. 2003. One-sided reduction to bidiagonal form. Lin. Alg. Appl. 358, 219--238.Google ScholarCross Ref
Schreiber, R. and Van Loan, C. F. 1989. A storage-efficient WY representation for products of householder transformations. SIAM Sci. Stat. Comp. 10, 53--57. Google ScholarDigital Library
Stanley, K. 1997. Execution time of symmetric eigensolvers. Ph.D. dissertation, University of California, Berkeley, CA. Google ScholarDigital Library
Whaley, C. and Dongarra, J. 1999. Automatically tuned linear algebra in software. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing. Google ScholarDigital Library

Index Terms

Cache efficient bidiagonalization using BLAS 2.5 operators
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
  2. Mathematical software

Recommendations

A Refined Harmonic Lanczos Bidiagonalization Method and an Implicitly Restarted Algorithm for Computing the Smallest Singular Triplets of Large Matrices

The harmonic Lanczos bidiagonalization method can be used to compute the smallest singular triplets of a large matrix $A$. We prove that for good enough projection subspaces harmonic Ritz values converge if the columns of $A$ are strongly linearly ...
Read More
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple ...
Read More
A bidiagonalization-based numerical algorithm for computing the inverses of (p,q)-tridiagonal matrices
Abstract
As a generalization of k-tridiagonal matrices, many variations of (p,q)-tridiagonal matrices have attracted much attention over the years. In this paper, we present an efficient algorithm for numerically computing the inverses of n-square (p,q)-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Mathematical Software Volume 34, Issue 3
May 2008
130 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/1356052
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 May 2008
- Accepted: 1 April 2007
- Revised: 1 January 2007
- Received: 1 April 2006
Published in toms Volume 34, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BLAS 2.5
Householder reflections
SVD
bidiagonalization
cache-efficient
matrix factorization
singular values
Qualifiers
- research-article
- Research
- Pre-selected
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 383
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cache efficient bidiagonalization using BLAS 2.5 operators

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

A Refined Harmonic Lanczos Bidiagonalization Method and an Implicitly Restarted Algorithm for Computing the Smallest Singular Triplets of Large Matrices

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

A bidiagonalization-based numerical algorithm for computing the inverses of (p,q)-tridiagonal matrices

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cache efficient bidiagonalization using BLAS 2.5 operators

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

A Refined Harmonic Lanczos Bidiagonalization Method and an Implicitly Restarted Algorithm for Computing the Smallest Singular Triplets of Large Matrices

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

A bidiagonalization-based numerical algorithm for computing the inverses of (p,q)-tridiagonal matrices

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media