skip to main content
research-article

Cache efficient bidiagonalization using BLAS 2.5 operators

Published:16 May 2008Publication History
Skip Abstract Section

Abstract

On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a general m × n matrix, so that two (_GEMV) vector-matrix multiplications can be done with one pass of the unreduced trailing part of the matrix through cache. Two new BLAS operations approximately cut in half the transfer of data from main memory to cache, reducing execution times by up to 25 per cent. We give detailed algorithm descriptions and compare timings with the current LAPACK bidiagonalization algorithm.

References

  1. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Greenbaum, A., Hammarling, S., Mckenney, A., and Sorensen, D. 1999. LAPACK User's Guide, 3rd. Ed. SIAM, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Barlow, J. L., Bosner, N., and Drmač, Z. 2000. A new stable bidiagonal reduction algorithm. Lin. Alg. Appl. 397, 35--84.Google ScholarGoogle ScholarCross RefCross Ref
  3. Berry, M. 1992. Large scale singular value computations. Internat. J. Supercomput. Appl. 6, 13--49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Berry, M., Do, T., O'brien, G., Krishna, V., and Varadhan, S. 1993. SVDPACKC: version 1.0 user's guide, Tech. rep. CS-93-194, University of Tennessee, Knoxville, TN. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Berry, M., Dumais, S., and O'brien, G. 1995. Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 4 573--595. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bischof, C. H. and Van Loan, C. F. 1987. The WY representaion of products of Householder matrices, SIAM J. Sci. Stat. Comput. 8, s2--s13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Blackford, S. and Dongarra, J. 1999. Installation guide for LAPACK, LAPACK Working Note 41.Google ScholarGoogle Scholar
  8. Blackford, L. S., Corliss, G., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Hu, C., Kahan, W., Kaufmann, L., Kearfott, B., Frogh, F., Li, X., Maany, Z., Petitet, A., Pozo, R., Remington, K., Walster, W., Whaley, C., Wolff, V., Gudenberg, J., and Lumsdaine, A. 2002. Basic linear algebra subprograms technical (BLAST) forum standard, Int. J. High Perform. Comput. 12, 1--2 (www.netlib.org/blas/blast-forum).Google ScholarGoogle Scholar
  9. Blackford, S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufmann, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., and Whaley, C. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2, 135--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bosner, N. and Barlow, J. L. 2005. Block and parallel versions of one-sided bidiagonalization. preprint http://www.cse.psu.edu/barlow/block_bidiag.pdf.Google ScholarGoogle Scholar
  11. Choi, J., Dongarra, J., and Walker, D. 1995. The design of a parallel dense linear algebra software library: reduction to Hessenberg, tridiagonal, and bidiagonal form. (LAPACK Working Note # 92) Num. Alg., 10, 379--399.Google ScholarGoogle Scholar
  12. Dhillon, I. S. 1997. A New O(n2) Algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. PhD thesis, University of California, Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dongarra, J., Hammarling, S., and Sorensen, D. 1989. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27, 215--227.Google ScholarGoogle ScholarCross RefCross Ref
  14. Dongarra, J., Duff, I., Sorensen, D., and Van Der Vorst, H. 1988. Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Douglas, C. C., Haase, G., Hu, J., Kowarschik, M., Rüde, U., and Weiss, C. 2000. Portable memory heirarchy techniques for PDE solvers: Part I. SIAM News, 33, 5.Google ScholarGoogle Scholar
  16. Fernando, V., Parlett, B., and Dhillon, I. 1995. A way to find the most redundant equation in a tridiagonal system. Mathematics Department, University of California, Berkeley.Google ScholarGoogle Scholar
  17. Goedecker, S. and Hoise, A. 2001. Performance Optimization for Numerically Intensive Codes, SIAM, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Golub, G. and Kahan, W. 1965. Calculating the singular Values and pseudo-inverse of a matrix, SIAM J. Num. Anal., 2, 205--224.Google ScholarGoogle Scholar
  19. Golub, G. and Reinsch, C. 1970. Singular value decomposition and least squares solution. Numer. Math. 14, 403--420.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Golub, G. and Van Loan, C. F. 1996. Matrix Computations 3rd Ed. Johns Hopkins University Press, Baltimore, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Grösser, B. and Lang, B. 1988. Efficient parallel reduction to bidiagonal form, Preprint BUGHW-SC 98/2. http://www.math.uni-wuppertal/org/SciComp/Preprint/SC9802ips.gz.Google ScholarGoogle Scholar
  22. Howell, G. W. 2001. Sparse Householder bidiagonalization. CERFACS Sparse Days. http://ncsu. edu/itd/hpc/Documents/Publications/gary_howell/cerfacs01.psGoogle ScholarGoogle Scholar
  23. Lang, B. 1996. Parallel reduction of banded matrices to bidiagonal form. Parall. Comput., 22, 1--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Owens, B. 2003. A Matlab script for 2-blocking to speed Ralha-Barlow one-sided bidiagonalization. Summer intern project at ERDC, MSRC, Vicksburg, MS. http://ncsu.edu/itd/hpc/Documents/Publications/gary_howell/barlow3.m.Google ScholarGoogle Scholar
  25. Paige, C. and Saunders, M. 1982. An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw. 8, 1, 43--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Parlett, B. and Dhillon, I. 1997. Fernando's solution to Wilkinson's problem: An application of double factorization. Lin. Alg. Appl. 267, 247--279.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ralha, R. M. S. 2003. One-sided reduction to bidiagonal form. Lin. Alg. Appl. 358, 219--238.Google ScholarGoogle ScholarCross RefCross Ref
  28. Schreiber, R. and Van Loan, C. F. 1989. A storage-efficient WY representation for products of householder transformations. SIAM Sci. Stat. Comp. 10, 53--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Stanley, K. 1997. Execution time of symmetric eigensolvers. Ph.D. dissertation, University of California, Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Whaley, C. and Dongarra, J. 1999. Automatically tuned linear algebra in software. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cache efficient bidiagonalization using BLAS 2.5 operators

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Mathematical Software
          ACM Transactions on Mathematical Software  Volume 34, Issue 3
          May 2008
          130 pages
          ISSN:0098-3500
          EISSN:1557-7295
          DOI:10.1145/1356052
          Issue’s Table of Contents

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 May 2008
          • Accepted: 1 April 2007
          • Revised: 1 January 2007
          • Received: 1 April 2006
          Published in toms Volume 34, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Pre-selected

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader