Abstract
On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a general m × n matrix, so that two (_GEMV) vector-matrix multiplications can be done with one pass of the unreduced trailing part of the matrix through cache. Two new BLAS operations approximately cut in half the transfer of data from main memory to cache, reducing execution times by up to 25 per cent. We give detailed algorithm descriptions and compare timings with the current LAPACK bidiagonalization algorithm.
- Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Greenbaum, A., Hammarling, S., Mckenney, A., and Sorensen, D. 1999. LAPACK User's Guide, 3rd. Ed. SIAM, Philadelphia, PA. Google ScholarDigital Library
- Barlow, J. L., Bosner, N., and Drmač, Z. 2000. A new stable bidiagonal reduction algorithm. Lin. Alg. Appl. 397, 35--84.Google ScholarCross Ref
- Berry, M. 1992. Large scale singular value computations. Internat. J. Supercomput. Appl. 6, 13--49.Google ScholarDigital Library
- Berry, M., Do, T., O'brien, G., Krishna, V., and Varadhan, S. 1993. SVDPACKC: version 1.0 user's guide, Tech. rep. CS-93-194, University of Tennessee, Knoxville, TN. Google ScholarDigital Library
- Berry, M., Dumais, S., and O'brien, G. 1995. Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 4 573--595. Google ScholarDigital Library
- Bischof, C. H. and Van Loan, C. F. 1987. The WY representaion of products of Householder matrices, SIAM J. Sci. Stat. Comput. 8, s2--s13. Google ScholarDigital Library
- Blackford, S. and Dongarra, J. 1999. Installation guide for LAPACK, LAPACK Working Note 41.Google Scholar
- Blackford, L. S., Corliss, G., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Hu, C., Kahan, W., Kaufmann, L., Kearfott, B., Frogh, F., Li, X., Maany, Z., Petitet, A., Pozo, R., Remington, K., Walster, W., Whaley, C., Wolff, V., Gudenberg, J., and Lumsdaine, A. 2002. Basic linear algebra subprograms technical (BLAST) forum standard, Int. J. High Perform. Comput. 12, 1--2 (www.netlib.org/blas/blast-forum).Google Scholar
- Blackford, S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufmann, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., and Whaley, C. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2, 135--151. Google ScholarDigital Library
- Bosner, N. and Barlow, J. L. 2005. Block and parallel versions of one-sided bidiagonalization. preprint http://www.cse.psu.edu/barlow/block_bidiag.pdf.Google Scholar
- Choi, J., Dongarra, J., and Walker, D. 1995. The design of a parallel dense linear algebra software library: reduction to Hessenberg, tridiagonal, and bidiagonal form. (LAPACK Working Note # 92) Num. Alg., 10, 379--399.Google Scholar
- Dhillon, I. S. 1997. A New O(n2) Algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. PhD thesis, University of California, Berkeley, CA. Google ScholarDigital Library
- Dongarra, J., Hammarling, S., and Sorensen, D. 1989. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27, 215--227.Google ScholarCross Ref
- Dongarra, J., Duff, I., Sorensen, D., and Van Der Vorst, H. 1988. Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia, PA. Google ScholarDigital Library
- Douglas, C. C., Haase, G., Hu, J., Kowarschik, M., Rüde, U., and Weiss, C. 2000. Portable memory heirarchy techniques for PDE solvers: Part I. SIAM News, 33, 5.Google Scholar
- Fernando, V., Parlett, B., and Dhillon, I. 1995. A way to find the most redundant equation in a tridiagonal system. Mathematics Department, University of California, Berkeley.Google Scholar
- Goedecker, S. and Hoise, A. 2001. Performance Optimization for Numerically Intensive Codes, SIAM, Philadelphia, PA. Google ScholarDigital Library
- Golub, G. and Kahan, W. 1965. Calculating the singular Values and pseudo-inverse of a matrix, SIAM J. Num. Anal., 2, 205--224.Google Scholar
- Golub, G. and Reinsch, C. 1970. Singular value decomposition and least squares solution. Numer. Math. 14, 403--420.Google ScholarDigital Library
- Golub, G. and Van Loan, C. F. 1996. Matrix Computations 3rd Ed. Johns Hopkins University Press, Baltimore, MA. Google ScholarDigital Library
- Grösser, B. and Lang, B. 1988. Efficient parallel reduction to bidiagonal form, Preprint BUGHW-SC 98/2. http://www.math.uni-wuppertal/org/SciComp/Preprint/SC9802ips.gz.Google Scholar
- Howell, G. W. 2001. Sparse Householder bidiagonalization. CERFACS Sparse Days. http://ncsu. edu/itd/hpc/Documents/Publications/gary_howell/cerfacs01.psGoogle Scholar
- Lang, B. 1996. Parallel reduction of banded matrices to bidiagonal form. Parall. Comput., 22, 1--18. Google ScholarDigital Library
- Owens, B. 2003. A Matlab script for 2-blocking to speed Ralha-Barlow one-sided bidiagonalization. Summer intern project at ERDC, MSRC, Vicksburg, MS. http://ncsu.edu/itd/hpc/Documents/Publications/gary_howell/barlow3.m.Google Scholar
- Paige, C. and Saunders, M. 1982. An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw. 8, 1, 43--71. Google ScholarDigital Library
- Parlett, B. and Dhillon, I. 1997. Fernando's solution to Wilkinson's problem: An application of double factorization. Lin. Alg. Appl. 267, 247--279.Google ScholarCross Ref
- Ralha, R. M. S. 2003. One-sided reduction to bidiagonal form. Lin. Alg. Appl. 358, 219--238.Google ScholarCross Ref
- Schreiber, R. and Van Loan, C. F. 1989. A storage-efficient WY representation for products of householder transformations. SIAM Sci. Stat. Comp. 10, 53--57. Google ScholarDigital Library
- Stanley, K. 1997. Execution time of symmetric eigensolvers. Ph.D. dissertation, University of California, Berkeley, CA. Google ScholarDigital Library
- Whaley, C. and Dongarra, J. 1999. Automatically tuned linear algebra in software. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing. Google ScholarDigital Library
Index Terms
- Cache efficient bidiagonalization using BLAS 2.5 operators
Recommendations
A Refined Harmonic Lanczos Bidiagonalization Method and an Implicitly Restarted Algorithm for Computing the Smallest Singular Triplets of Large Matrices
The harmonic Lanczos bidiagonalization method can be used to compute the smallest singular triplets of a large matrix $A$. We prove that for good enough projection subspaces harmonic Ritz values converge if the columns of $A$ are strongly linearly ...
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance
We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple ...
A bidiagonalization-based numerical algorithm for computing the inverses of (p,q)-tridiagonal matrices
AbstractAs a generalization of k-tridiagonal matrices, many variations of (p,q)-tridiagonal matrices have attracted much attention over the years. In this paper, we present an efficient algorithm for numerically computing the inverses of n-square (p,q)-...
Comments