skip to main content
10.1145/3369583.3392685acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

High Accuracy Matrix Computations on Neural Engines: A Study of QR Factorization and its Applications

Published:23 June 2020Publication History

ABSTRACT

Fueled by the surge of ever expanding successful applications of deep neural networks and the great computational power demanded, modern computer processors and accelerators are beginning to offer half precision floating point arithmetic support, and special units (neural engines) such as NVIDIA TensorCore on GPU and Google Tensor Processing Unit (TPU) to accelerate the training and prediction of deep neural networks. It remains unclear how neural engines can be profitably used in applications other than neural networks. In this paper we present an endeavor of accelerating and stabilizing a fundamental matrix factorization on neural engines---the QR factorization---which may open doors to much wider relevance to scientific, engineering, and data science. We show that traditional Householder QR algorithms and implementations do not have the necessary data locality, parallelism, accuracy, and robustness on neural engines which are characterized by extreme speed and low precision/range.

We demonstrate that neural engines can be effectively used to accelerate matrix computations (QR 3.0x-14.6x speedup compared to cuSOLVER, reaching up to 36.6TFLOPS); however different algorithms (recursive Gram-Schmidt) are needed to expose more locality and parallelism, even at the cost of increased computations. Moreover, scaling, iterative refinement, and other safeguarding procedures are also needed to regain accuracy and avoid overflowing. Our experience seems to suggest that presently with neural engines the matrix factorizations (QR, LU, Cholesky) are best to be co-designed with its applications (linear solver, least square, orthogonalization, SVD, etc) to achieve high performance and adequate accuracy and reliability.

Skip Supplemental Material Section

Supplemental Material

3369583.3392685.mp4

mp4

379.4 MB

References

  1. E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK Users' Guide. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719604Google ScholarGoogle Scholar
  2. Michael Anderson, Grey Ballard, James Demmel, and Kurt Keutzer. 2011. Communication-Avoiding QR Decomposition for GPUs. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, Anchorage, AK, USA, 48--58. https://doi.org/10.1109/IPDPS.2011.15Google ScholarGoogle Scholar
  3. Jesse L. Barlow and Alicja Smoktunowicz. 2011. Reorthogonalized Block Classical Gram--Schmidt. arXiv:1108.4209 [math] (Aug. 2011). http://arxiv.org/abs/1108.4209 arXiv: 1108.4209.Google ScholarGoogle Scholar
  4. Christian Bischof and Charles Van Loan. 1987. The WY Representation for Products of Householder Matrices. SIAM J. Sci. Statist. Comput., Vol. 8, 1 (Jan. 1987), s2--s13. https://doi.org/10.1137/0908009Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Åke Björck. 1967. Iterative refinement of linear least squares solutions I. BIT, Vol. 7, 4 (Dec. 1967), 257--278. https://doi.org/10.1007/BF01939321Google ScholarGoogle Scholar
  6. Åke Björck. 1968. Iterative refinement of linear least squares solutions II. BIT, Vol. 8, 1 (March 1968), 8--30. https://doi.org/10.1007/BF01939974Google ScholarGoogle Scholar
  7. L. S. Blackford, Jaeyoung Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R. C. Whaley, and Jack Dongarra. 1997. ScaLAPACK user's guide .SIAM.Google ScholarGoogle Scholar
  8. Pierre Blanchard, Nicholas J Higham, Florent Lopez, Theo Mary, and Srikara Pranesh. 2019. MIXED PRECISION BLOCK FUSED MULTIPLY-ADD: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES. Technical Report. The University of Manchester. 16 pages.Google ScholarGoogle Scholar
  9. Erin Carson and Nicholas J Higham. 2017a. Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. Technical Report. University of Manchester. http://eprints.maths.manchester.ac.uk/Google ScholarGoogle Scholar
  10. Erin Carson and Nicholas J Higham. 2017b. A New Analysis of Iterative Refinement and its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, Vol. 39, 6 (2017), A2834--A2856. https://doi.org/10.1137/17M1122918Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Abdul Dakkak, Cheng Li, Isaac Gelado, Jinjun Xiong, and Wen-mei Hwu. 2019. Accelerating Reduction and Scan Using Tensor Core Units. Proceedings of the ACM International Conference on Supercomputing - ICS '19 (2019), 46--57. https://doi.org/10.1145/3330345.3331057 arXiv: 1811.09736.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. James Demmel. 2007. Extra-precise Iterative Refinement for Overdetermined Least Squares Problems. Technical Report. Lapack Working Notes.Google ScholarGoogle Scholar
  13. James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal Parallel and Sequential QR and LU Factorizations. SIAM Journal on Scientific Computing, Vol. 34, 1 (Jan. 2012), A206--A239. https://doi.org/10.1137/080731992Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2014. Accelerating numerical dense linear algebra calculations with GPUs. Numerical Computations with GPUs (2014), 3--28. https://doi.org/10.1007/978--3--319-06548--9_1 ISBN: 9783319065489.Google ScholarGoogle Scholar
  15. Jack Dongarra, Piotr Luszczek, and David Stevens. 2017. PLASMA 17 Performance Report Linear Systems and Least Squares. Technical Report. University of Tennessee, LAPACK Working Note #292.Google ScholarGoogle Scholar
  16. Iain S Duff and Serge Gratton. 2006. The Parallel Algorithms Team at CERFACS .Google ScholarGoogle Scholar
  17. E. Elmroth and F. G. Gustavson. 2000. Applying recursion to serial and parallel QR factorization leads to better performance. IBM Journal of Research and Development, Vol. 44, 4 (July 2000), 605--624. https://doi.org/10.1147/rd.444.0605Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Luc Giraud, Julien Langou, Miroslav Rozloník, and Jasper van den Eshof. 2005. Rounding error analysis of the classical Gram-Schmidt orthogonalization process. Numer. Math., Vol. 101, 1 (July 2005), 87--100. https://doi.org/10.1007/s00211-005-0615--4Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gene H. Golub and Charles F. Van Loan. 2012. Matrix Computations. JHU Press. https://books.google.com/books?id=5U-l8U3P-VUCGoogle ScholarGoogle Scholar
  20. Gaël Guennebaud, Benotextasciicircumit Jacob, and others. 2010. Eigen v3. (2010). http://eigen.tuxfamily.orgGoogle ScholarGoogle Scholar
  21. Azzam Haidar, Ahmad Abdelfattah, Mawussi Zounon, Panruo Wu, Srikara Pranesh, Stanimire Tomov, and Jack J. Dongarra. 2018a. The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques. In Computational Science - ICCS 2018 - 18th International Conference, Wuxi, China, June 11--13, 2018, Proceedings, Part I. 586--600. https://doi.org/10.1007/978--3--319--93698--7_45Google ScholarGoogle Scholar
  22. Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018b. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In SC.Google ScholarGoogle Scholar
  23. Azzam Haidar, Panruo Wu, Stanimire Tomov, and Jack Dongarra. 2017. Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers. In 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Magnus Rudolph Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems. Vol. 49. NBS Washington, DC.Google ScholarGoogle Scholar
  25. Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898718027Google ScholarGoogle Scholar
  26. Nicholas J. Higham and Theo Mary. 2018. A New Approach to Probabilistic Rounding Error Analysis. Technical Report. The University of Manchester.Google ScholarGoogle Scholar
  27. Alston S. Householder. 1958. Unitary Triangularization of a Nonsymmetric Matrix. J. ACM, Vol. 5, 4 (Oct. 1958), 339--342. https://doi.org/10.1145/320941.320947Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yamazaki Ichitaro., Tomov Stanimire., and Dongarra Jack. 2015. Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs. SIAM Journal on Scientific Computing, Vol. 37, 3 (2015), C307--C330. https://doi.org/10.1137/14M0973773 https://doi.org/10.1145/2427023.2427030 arXiv: 1502.07526v1 ISBN: 9781577357384.Google ScholarGoogle ScholarCross RefCross Ref
  29. Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. (2018). http://arxiv.org/abs/1804.06826 arXiv: 1804.06826.Google ScholarGoogle Scholar
  30. Jakub Kurzak, Panruo Wu, Mark Gates, Ichitaro Yamazaki, Piotr Luszczek, Gerald Ragghianti, and Jack Dongarra. 2017.Designing SLATE: Software for Linear Algebra Targeting Exascale. SLATE Working Notes 3, ICL-UT-17-06. Innovative Computing Laboratory, University of Tennessee.Google ScholarGoogle Scholar
  31. Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA tensor core programmability, performance & precision.Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018(2018), 522?531. https://doi.org/10.1109/IPDPSW.2018.00091 arXiv: 1803.04014 ISBN: 9781538655559.Google ScholarGoogle Scholar
  32. Nvidia. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. Technical Report. 53 pages. Issue: v1.1.Google ScholarGoogle Scholar
  33. Hiroyuki Ootomo. and Rio Yokota. 2019. TSQR on Tensor Cores. SC '19, 29 The International Conference for High Performance Computing, Networking, Storage, and Analysis (2019). https://doi.org/10.1145/1122445.1122456Google ScholarGoogle Scholar
  34. Christopher C. Paige and Michael A. Saunders. 1982. LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares. ACM Trans. Math. Softw. 8, 1(March 1982), 43-71. https://doi.org/10.1145/355984.355989Google ScholarGoogle Scholar
  35. Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols a. Romero. 2013. Elemental: A New Framework for Distributed MemoryDense Matrix Computations. ACM Trans. Math. Software 39, 2 (2013), 1-24. https://doi.org/10.1145/2427023.2427030 arXiv: 1502.07526v1 ISBN: 9781577357384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Schreiber and C. Van Loan. 1989. A Storage-Efficient $WY$ Representation for Products of Householder Transformations. SIAM J. Sci. Statist. Comput., Vol. 10, 1 (1989), 53--57. https://doi.org/10.1137/0910005Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Gilbert W Stewart. 1998. Matrix Algorithms: Volume 1: Basic Decompositions. Vol. 1. Siam.Google ScholarGoogle ScholarCross RefCross Ref
  38. G. W. Stewart. 2008. Block Gram--Schmidt Orthogonalization. SIAM Journal on Scientific Computing, Vol. 31, 1 (Jan. 2008), 761--775. https://doi.org/10.1137/070682563Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. N. Trefethen and D. III Bau. 1997. Numerical Linear Algebra. Vol. 12. SIAM. https://doi.org/10.1137/1.9780898719574 Publication Title: Numerical Linear Algebra with Applications ISSN: 1070--5325.Google ScholarGoogle Scholar
  40. Å.Björck. 1994. Numerics ofGram-Schmidt orthogonalization. Linear Algebra Appl., Vol. 197--198, 1 (1994), 297--316. https://doi.org/10.1016/0024--3795(94)90493--6Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. High Accuracy Matrix Computations on Neural Engines: A Study of QR Factorization and its Applications

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader