ABSTRACT
Fueled by the surge of ever expanding successful applications of deep neural networks and the great computational power demanded, modern computer processors and accelerators are beginning to offer half precision floating point arithmetic support, and special units (neural engines) such as NVIDIA TensorCore on GPU and Google Tensor Processing Unit (TPU) to accelerate the training and prediction of deep neural networks. It remains unclear how neural engines can be profitably used in applications other than neural networks. In this paper we present an endeavor of accelerating and stabilizing a fundamental matrix factorization on neural engines---the QR factorization---which may open doors to much wider relevance to scientific, engineering, and data science. We show that traditional Householder QR algorithms and implementations do not have the necessary data locality, parallelism, accuracy, and robustness on neural engines which are characterized by extreme speed and low precision/range.
We demonstrate that neural engines can be effectively used to accelerate matrix computations (QR 3.0x-14.6x speedup compared to cuSOLVER, reaching up to 36.6TFLOPS); however different algorithms (recursive Gram-Schmidt) are needed to expose more locality and parallelism, even at the cost of increased computations. Moreover, scaling, iterative refinement, and other safeguarding procedures are also needed to regain accuracy and avoid overflowing. Our experience seems to suggest that presently with neural engines the matrix factorizations (QR, LU, Cholesky) are best to be co-designed with its applications (linear solver, least square, orthogonalization, SVD, etc) to achieve high performance and adequate accuracy and reliability.
Supplemental Material
- E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK Users' Guide. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719604Google Scholar
- Michael Anderson, Grey Ballard, James Demmel, and Kurt Keutzer. 2011. Communication-Avoiding QR Decomposition for GPUs. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, Anchorage, AK, USA, 48--58. https://doi.org/10.1109/IPDPS.2011.15Google Scholar
- Jesse L. Barlow and Alicja Smoktunowicz. 2011. Reorthogonalized Block Classical Gram--Schmidt. arXiv:1108.4209 [math] (Aug. 2011). http://arxiv.org/abs/1108.4209 arXiv: 1108.4209.Google Scholar
- Christian Bischof and Charles Van Loan. 1987. The WY Representation for Products of Householder Matrices. SIAM J. Sci. Statist. Comput., Vol. 8, 1 (Jan. 1987), s2--s13. https://doi.org/10.1137/0908009Google ScholarDigital Library
- Åke Björck. 1967. Iterative refinement of linear least squares solutions I. BIT, Vol. 7, 4 (Dec. 1967), 257--278. https://doi.org/10.1007/BF01939321Google Scholar
- Åke Björck. 1968. Iterative refinement of linear least squares solutions II. BIT, Vol. 8, 1 (March 1968), 8--30. https://doi.org/10.1007/BF01939974Google Scholar
- L. S. Blackford, Jaeyoung Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R. C. Whaley, and Jack Dongarra. 1997. ScaLAPACK user's guide .SIAM.Google Scholar
- Pierre Blanchard, Nicholas J Higham, Florent Lopez, Theo Mary, and Srikara Pranesh. 2019. MIXED PRECISION BLOCK FUSED MULTIPLY-ADD: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES. Technical Report. The University of Manchester. 16 pages.Google Scholar
- Erin Carson and Nicholas J Higham. 2017a. Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. Technical Report. University of Manchester. http://eprints.maths.manchester.ac.uk/Google Scholar
- Erin Carson and Nicholas J Higham. 2017b. A New Analysis of Iterative Refinement and its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, Vol. 39, 6 (2017), A2834--A2856. https://doi.org/10.1137/17M1122918Google ScholarDigital Library
- Abdul Dakkak, Cheng Li, Isaac Gelado, Jinjun Xiong, and Wen-mei Hwu. 2019. Accelerating Reduction and Scan Using Tensor Core Units. Proceedings of the ACM International Conference on Supercomputing - ICS '19 (2019), 46--57. https://doi.org/10.1145/3330345.3331057 arXiv: 1811.09736.Google ScholarDigital Library
- James Demmel. 2007. Extra-precise Iterative Refinement for Overdetermined Least Squares Problems. Technical Report. Lapack Working Notes.Google Scholar
- James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal Parallel and Sequential QR and LU Factorizations. SIAM Journal on Scientific Computing, Vol. 34, 1 (Jan. 2012), A206--A239. https://doi.org/10.1137/080731992Google ScholarDigital Library
- Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2014. Accelerating numerical dense linear algebra calculations with GPUs. Numerical Computations with GPUs (2014), 3--28. https://doi.org/10.1007/978--3--319-06548--9_1 ISBN: 9783319065489.Google Scholar
- Jack Dongarra, Piotr Luszczek, and David Stevens. 2017. PLASMA 17 Performance Report Linear Systems and Least Squares. Technical Report. University of Tennessee, LAPACK Working Note #292.Google Scholar
- Iain S Duff and Serge Gratton. 2006. The Parallel Algorithms Team at CERFACS .Google Scholar
- E. Elmroth and F. G. Gustavson. 2000. Applying recursion to serial and parallel QR factorization leads to better performance. IBM Journal of Research and Development, Vol. 44, 4 (July 2000), 605--624. https://doi.org/10.1147/rd.444.0605Google ScholarDigital Library
- Luc Giraud, Julien Langou, Miroslav Rozloník, and Jasper van den Eshof. 2005. Rounding error analysis of the classical Gram-Schmidt orthogonalization process. Numer. Math., Vol. 101, 1 (July 2005), 87--100. https://doi.org/10.1007/s00211-005-0615--4Google ScholarDigital Library
- Gene H. Golub and Charles F. Van Loan. 2012. Matrix Computations. JHU Press. https://books.google.com/books?id=5U-l8U3P-VUCGoogle Scholar
- Gaël Guennebaud, Benotextasciicircumit Jacob, and others. 2010. Eigen v3. (2010). http://eigen.tuxfamily.orgGoogle Scholar
- Azzam Haidar, Ahmad Abdelfattah, Mawussi Zounon, Panruo Wu, Srikara Pranesh, Stanimire Tomov, and Jack J. Dongarra. 2018a. The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques. In Computational Science - ICCS 2018 - 18th International Conference, Wuxi, China, June 11--13, 2018, Proceedings, Part I. 586--600. https://doi.org/10.1007/978--3--319--93698--7_45Google Scholar
- Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018b. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In SC.Google Scholar
- Azzam Haidar, Panruo Wu, Stanimire Tomov, and Jack Dongarra. 2017. Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers. In 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.Google ScholarDigital Library
- Magnus Rudolph Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems. Vol. 49. NBS Washington, DC.Google Scholar
- Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898718027Google Scholar
- Nicholas J. Higham and Theo Mary. 2018. A New Approach to Probabilistic Rounding Error Analysis. Technical Report. The University of Manchester.Google Scholar
- Alston S. Householder. 1958. Unitary Triangularization of a Nonsymmetric Matrix. J. ACM, Vol. 5, 4 (Oct. 1958), 339--342. https://doi.org/10.1145/320941.320947Google ScholarDigital Library
- Yamazaki Ichitaro., Tomov Stanimire., and Dongarra Jack. 2015. Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs. SIAM Journal on Scientific Computing, Vol. 37, 3 (2015), C307--C330. https://doi.org/10.1137/14M0973773 https://doi.org/10.1145/2427023.2427030 arXiv: 1502.07526v1 ISBN: 9781577357384.Google ScholarCross Ref
- Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. (2018). http://arxiv.org/abs/1804.06826 arXiv: 1804.06826.Google Scholar
- Jakub Kurzak, Panruo Wu, Mark Gates, Ichitaro Yamazaki, Piotr Luszczek, Gerald Ragghianti, and Jack Dongarra. 2017.Designing SLATE: Software for Linear Algebra Targeting Exascale. SLATE Working Notes 3, ICL-UT-17-06. Innovative Computing Laboratory, University of Tennessee.Google Scholar
- Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA tensor core programmability, performance & precision.Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018(2018), 522?531. https://doi.org/10.1109/IPDPSW.2018.00091 arXiv: 1803.04014 ISBN: 9781538655559.Google Scholar
- Nvidia. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. Technical Report. 53 pages. Issue: v1.1.Google Scholar
- Hiroyuki Ootomo. and Rio Yokota. 2019. TSQR on Tensor Cores. SC '19, 29 The International Conference for High Performance Computing, Networking, Storage, and Analysis (2019). https://doi.org/10.1145/1122445.1122456Google Scholar
- Christopher C. Paige and Michael A. Saunders. 1982. LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares. ACM Trans. Math. Softw. 8, 1(March 1982), 43-71. https://doi.org/10.1145/355984.355989Google Scholar
- Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols a. Romero. 2013. Elemental: A New Framework for Distributed MemoryDense Matrix Computations. ACM Trans. Math. Software 39, 2 (2013), 1-24. https://doi.org/10.1145/2427023.2427030 arXiv: 1502.07526v1 ISBN: 9781577357384.Google ScholarDigital Library
- R. Schreiber and C. Van Loan. 1989. A Storage-Efficient $WY$ Representation for Products of Householder Transformations. SIAM J. Sci. Statist. Comput., Vol. 10, 1 (1989), 53--57. https://doi.org/10.1137/0910005Google ScholarDigital Library
- Gilbert W Stewart. 1998. Matrix Algorithms: Volume 1: Basic Decompositions. Vol. 1. Siam.Google ScholarCross Ref
- G. W. Stewart. 2008. Block Gram--Schmidt Orthogonalization. SIAM Journal on Scientific Computing, Vol. 31, 1 (Jan. 2008), 761--775. https://doi.org/10.1137/070682563Google ScholarDigital Library
- L. N. Trefethen and D. III Bau. 1997. Numerical Linear Algebra. Vol. 12. SIAM. https://doi.org/10.1137/1.9780898719574 Publication Title: Numerical Linear Algebra with Applications ISSN: 1070--5325.Google Scholar
- Å.Björck. 1994. Numerics ofGram-Schmidt orthogonalization. Linear Algebra Appl., Vol. 197--198, 1 (1994), 297--316. https://doi.org/10.1016/0024--3795(94)90493--6Google ScholarCross Ref
Index Terms
- High Accuracy Matrix Computations on Neural Engines: A Study of QR Factorization and its Applications
Recommendations
Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics Processors
Euro-Par 2019: Parallel ProcessingAbstractWe present a method for the QR factorization of large tall-and-skinny matrices that combines block Gram-Schmidt and the Cholesky decomposition to factorize the input matrix column panels, overcoming the sequential nature of this operation. This ...
Householder QR Factorization With Randomization for Column Pivoting (HQRRP)
A fundamental problem when adding column pivoting to the Householder QR factorization is that only about half of the computation can be cast in terms of high performing matrix-matrix multiplications, which greatly limits the benefits that can be derived ...
A BLAS-3 Version of the QR Factorization with Column Pivoting
The QR factorization with column pivoting (QRP), originally suggested by Golub [Numer. Math., 7 (1965), 206--216], is a popular approach to computing rank-revealing factorizations. Using Level 1 BLAS, it was implemented in LINPACK, and, using Level 2 BLAS,...
Comments