research-article

High Accuracy Matrix Computations on Neural Engines: A Study of QR Factorization and its Applications

Authors:
Shaoshuai Zhang

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

,
Elaheh Baharlouei

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

,
Panruo Wu

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed ComputingJune 2020Pages 17–28https://doi.org/10.1145/3369583.3392685

Published:23 June 2020Publication History

HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

Pages 17–28

ABSTRACT

Fueled by the surge of ever expanding successful applications of deep neural networks and the great computational power demanded, modern computer processors and accelerators are beginning to offer half precision floating point arithmetic support, and special units (neural engines) such as NVIDIA TensorCore on GPU and Google Tensor Processing Unit (TPU) to accelerate the training and prediction of deep neural networks. It remains unclear how neural engines can be profitably used in applications other than neural networks. In this paper we present an endeavor of accelerating and stabilizing a fundamental matrix factorization on neural engines---the QR factorization---which may open doors to much wider relevance to scientific, engineering, and data science. We show that traditional Householder QR algorithms and implementations do not have the necessary data locality, parallelism, accuracy, and robustness on neural engines which are characterized by extreme speed and low precision/range.

We demonstrate that neural engines can be effectively used to accelerate matrix computations (QR 3.0x-14.6x speedup compared to cuSOLVER, reaching up to 36.6TFLOPS); however different algorithms (recursive Gram-Schmidt) are needed to expose more locality and parallelism, even at the cost of increased computations. Moreover, scaling, iterative refinement, and other safeguarding procedures are also needed to regain accuracy and avoid overflowing. Our experience seems to suggest that presently with neural engines the matrix factorizations (QR, LU, Cholesky) are best to be co-designed with its applications (linear solver, least square, orthogonalization, SVD, etc) to achieve high performance and adequate accuracy and reliability.

Supplemental Material

3369583.3392685.mp4

mp4

379.4 MB

Download

References

E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK Users' Guide. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719604Google Scholar
Michael Anderson, Grey Ballard, James Demmel, and Kurt Keutzer. 2011. Communication-Avoiding QR Decomposition for GPUs. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, Anchorage, AK, USA, 48--58. https://doi.org/10.1109/IPDPS.2011.15Google Scholar
Jesse L. Barlow and Alicja Smoktunowicz. 2011. Reorthogonalized Block Classical Gram--Schmidt. arXiv:1108.4209 [math] (Aug. 2011). http://arxiv.org/abs/1108.4209 arXiv: 1108.4209.Google Scholar
Christian Bischof and Charles Van Loan. 1987. The WY Representation for Products of Householder Matrices. SIAM J. Sci. Statist. Comput., Vol. 8, 1 (Jan. 1987), s2--s13. https://doi.org/10.1137/0908009Google ScholarDigital Library
Åke Björck. 1967. Iterative refinement of linear least squares solutions I. BIT, Vol. 7, 4 (Dec. 1967), 257--278. https://doi.org/10.1007/BF01939321Google Scholar
Åke Björck. 1968. Iterative refinement of linear least squares solutions II. BIT, Vol. 8, 1 (March 1968), 8--30. https://doi.org/10.1007/BF01939974Google Scholar
L. S. Blackford, Jaeyoung Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R. C. Whaley, and Jack Dongarra. 1997. ScaLAPACK user's guide .SIAM.Google Scholar
Pierre Blanchard, Nicholas J Higham, Florent Lopez, Theo Mary, and Srikara Pranesh. 2019. MIXED PRECISION BLOCK FUSED MULTIPLY-ADD: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES. Technical Report. The University of Manchester. 16 pages.Google Scholar
Erin Carson and Nicholas J Higham. 2017a. Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. Technical Report. University of Manchester. http://eprints.maths.manchester.ac.uk/Google Scholar
Erin Carson and Nicholas J Higham. 2017b. A New Analysis of Iterative Refinement and its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, Vol. 39, 6 (2017), A2834--A2856. https://doi.org/10.1137/17M1122918Google ScholarDigital Library
Abdul Dakkak, Cheng Li, Isaac Gelado, Jinjun Xiong, and Wen-mei Hwu. 2019. Accelerating Reduction and Scan Using Tensor Core Units. Proceedings of the ACM International Conference on Supercomputing - ICS '19 (2019), 46--57. https://doi.org/10.1145/3330345.3331057 arXiv: 1811.09736.Google ScholarDigital Library
James Demmel. 2007. Extra-precise Iterative Refinement for Overdetermined Least Squares Problems. Technical Report. Lapack Working Notes.Google Scholar
James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal Parallel and Sequential QR and LU Factorizations. SIAM Journal on Scientific Computing, Vol. 34, 1 (Jan. 2012), A206--A239. https://doi.org/10.1137/080731992Google ScholarDigital Library
Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2014. Accelerating numerical dense linear algebra calculations with GPUs. Numerical Computations with GPUs (2014), 3--28. https://doi.org/10.1007/978--3--319-06548--9_1 ISBN: 9783319065489.Google Scholar
Jack Dongarra, Piotr Luszczek, and David Stevens. 2017. PLASMA 17 Performance Report Linear Systems and Least Squares. Technical Report. University of Tennessee, LAPACK Working Note #292.Google Scholar
Iain S Duff and Serge Gratton. 2006. The Parallel Algorithms Team at CERFACS .Google Scholar
E. Elmroth and F. G. Gustavson. 2000. Applying recursion to serial and parallel QR factorization leads to better performance. IBM Journal of Research and Development, Vol. 44, 4 (July 2000), 605--624. https://doi.org/10.1147/rd.444.0605Google ScholarDigital Library
Luc Giraud, Julien Langou, Miroslav Rozloník, and Jasper van den Eshof. 2005. Rounding error analysis of the classical Gram-Schmidt orthogonalization process. Numer. Math., Vol. 101, 1 (July 2005), 87--100. https://doi.org/10.1007/s00211-005-0615--4Google ScholarDigital Library
Gene H. Golub and Charles F. Van Loan. 2012. Matrix Computations. JHU Press. https://books.google.com/books?id=5U-l8U3P-VUCGoogle Scholar
Gaël Guennebaud, Benotextasciicircumit Jacob, and others. 2010. Eigen v3. (2010). http://eigen.tuxfamily.orgGoogle Scholar
Azzam Haidar, Ahmad Abdelfattah, Mawussi Zounon, Panruo Wu, Srikara Pranesh, Stanimire Tomov, and Jack J. Dongarra. 2018a. The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques. In Computational Science - ICCS 2018 - 18th International Conference, Wuxi, China, June 11--13, 2018, Proceedings, Part I. 586--600. https://doi.org/10.1007/978--3--319--93698--7_45Google Scholar
Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018b. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In SC.Google Scholar
Azzam Haidar, Panruo Wu, Stanimire Tomov, and Jack Dongarra. 2017. Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers. In 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.Google ScholarDigital Library
Magnus Rudolph Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems. Vol. 49. NBS Washington, DC.Google Scholar
Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898718027Google Scholar
Nicholas J. Higham and Theo Mary. 2018. A New Approach to Probabilistic Rounding Error Analysis. Technical Report. The University of Manchester.Google Scholar
Alston S. Householder. 1958. Unitary Triangularization of a Nonsymmetric Matrix. J. ACM, Vol. 5, 4 (Oct. 1958), 339--342. https://doi.org/10.1145/320941.320947Google ScholarDigital Library
Yamazaki Ichitaro., Tomov Stanimire., and Dongarra Jack. 2015. Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs. SIAM Journal on Scientific Computing, Vol. 37, 3 (2015), C307--C330. https://doi.org/10.1137/14M0973773 https://doi.org/10.1145/2427023.2427030 arXiv: 1502.07526v1 ISBN: 9781577357384.Google ScholarCross Ref
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. (2018). http://arxiv.org/abs/1804.06826 arXiv: 1804.06826.Google Scholar
Jakub Kurzak, Panruo Wu, Mark Gates, Ichitaro Yamazaki, Piotr Luszczek, Gerald Ragghianti, and Jack Dongarra. 2017.Designing SLATE: Software for Linear Algebra Targeting Exascale. SLATE Working Notes 3, ICL-UT-17-06. Innovative Computing Laboratory, University of Tennessee.Google Scholar
Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA tensor core programmability, performance & precision.Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018(2018), 522?531. https://doi.org/10.1109/IPDPSW.2018.00091 arXiv: 1803.04014 ISBN: 9781538655559.Google Scholar
Nvidia. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. Technical Report. 53 pages. Issue: v1.1.Google Scholar
Hiroyuki Ootomo. and Rio Yokota. 2019. TSQR on Tensor Cores. SC '19, 29 The International Conference for High Performance Computing, Networking, Storage, and Analysis (2019). https://doi.org/10.1145/1122445.1122456Google Scholar
Christopher C. Paige and Michael A. Saunders. 1982. LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares. ACM Trans. Math. Softw. 8, 1(March 1982), 43-71. https://doi.org/10.1145/355984.355989Google Scholar
Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols a. Romero. 2013. Elemental: A New Framework for Distributed MemoryDense Matrix Computations. ACM Trans. Math. Software 39, 2 (2013), 1-24. https://doi.org/10.1145/2427023.2427030 arXiv: 1502.07526v1 ISBN: 9781577357384.Google ScholarDigital Library
R. Schreiber and C. Van Loan. 1989. A Storage-Efficient $WY$ Representation for Products of Householder Transformations. SIAM J. Sci. Statist. Comput., Vol. 10, 1 (1989), 53--57. https://doi.org/10.1137/0910005Google ScholarDigital Library
Gilbert W Stewart. 1998. Matrix Algorithms: Volume 1: Basic Decompositions. Vol. 1. Siam.Google ScholarCross Ref
G. W. Stewart. 2008. Block Gram--Schmidt Orthogonalization. SIAM Journal on Scientific Computing, Vol. 31, 1 (Jan. 2008), 761--775. https://doi.org/10.1137/070682563Google ScholarDigital Library
L. N. Trefethen and D. III Bau. 1997. Numerical Linear Algebra. Vol. 12. SIAM. https://doi.org/10.1137/1.9780898719574 Publication Title: Numerical Linear Algebra with Applications ISSN: 1070--5325.Google Scholar
Å.Björck. 1994. Numerics ofGram-Schmidt orthogonalization. Linear Algebra Appl., Vol. 197--198, 1 (1994), 297--316. https://doi.org/10.1016/0024--3795(94)90493--6Google ScholarCross Ref

Index Terms

High Accuracy Matrix Computations on Neural Engines: A Study of QR Factorization and its Applications

Recommendations

Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics Processors
Euro-Par 2019: Parallel Processing
Abstract
We present a method for the QR factorization of large tall-and-skinny matrices that combines block Gram-Schmidt and the Cholesky decomposition to factorize the input matrix column panels, overcoming the sequential nature of this operation. This ...
Read More
Householder QR Factorization With Randomization for Column Pivoting (HQRRP)

A fundamental problem when adding column pivoting to the Householder QR factorization is that only about half of the computation can be cast in terms of high performing matrix-matrix multiplications, which greatly limits the benefits that can be derived ...
Read More
A BLAS-3 Version of the QR Factorization with Column Pivoting

The QR factorization with column pivoting (QRP), originally suggested by Golub [Numer. Math., 7 (1965), 206--216], is a popular approach to computing rank-revealing factorizations. Using Level 1 BLAS, it was implemented in LINPACK, and, using Level 2 BLAS,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
June 2020
246 pages
ISBN:9781450370523
DOI:10.1145/3369583
General Chairs:
Manish Parashar
Rutgers University, USA
,
Vladimir Vlassov
KTH Royal Institute of Technology, Stockholm, Sweden
,
Program Chairs:
David Irwin
University of Massachusetts Amherst, USA
,
Kathryn Mohror
Lawrence Livermore National Laboratory, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
Gram-Schmidt
QR factorization
half precision
householder QR
least square problem
matrix computations
mixed precision
neural engine
numerical linear algebra
recursive QR
tensorcore
v100
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate166of966submissions,17%
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 274
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

High Accuracy Matrix Computations on Neural Engines: A Study of QR Factorization and its Applications

HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics Processors

Householder QR Factorization With Randomization for Column Pivoting (HQRRP)

A BLAS-3 Version of the QR Factorization with Column Pivoting