skip to main content
research-article
Open Access

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Published:04 March 2020Publication History
Skip Abstract Section

Abstract

We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning–based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix shapes and batch sizes with low overhead and high accuracy. In particular, the optimized batched matrix multiplications can substantially outperform the non-batched version and reach around 84.8% of the performance upper bound.

References

  1. Intel Corporation. 2019. https://software.intel.com/en-us/intel-mkl.Google ScholarGoogle Scholar
  2. NVIDIA Corporation. 2019. https://docs.nvidia.com/cuda/cublas/.Google ScholarGoogle Scholar
  3. Eigen project. 2019. http://eigen.tuxfamily.org/index.php?title=Main_Page.Google ScholarGoogle Scholar
  4. LAPACK project. 2019. http://www.netlib.org/lapack/.Google ScholarGoogle Scholar
  5. MAGMA project. 2019. http://icl.cs.utk.edu/magma/.Google ScholarGoogle Scholar
  6. Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org 1, 2 (2015). http://dx.doi.org/10.1177/1094342010385729Google ScholarGoogle Scholar
  7. Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack Dongarra, Christopher Earl, Joel Falcou, Azzam Haidar, Ian Karlin, Tz Kolev, Ian Masliah et al. 2016. High-performance tensor contractions for GPUs. Proc. Comput. Sci. 80 (2016). Elsevier, 108--118.Google ScholarGoogle Scholar
  8. Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, design, and autotuning of batched GEMM for GPUs. In Proceedings of the International Conference on High Performance Computing. Springer, 21--38.Google ScholarGoogle ScholarCross RefCross Ref
  9. Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2017. Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. In Proceedings of the International Conference on Supercomputing. ACM, 5:1--5:10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Emmanuel Agullo, Luc Giraud, and Mawussi Zounon. 2015. On the resilience of parallel sparse hybrid solvers. In Proceedings of the International Conference on High Performance Computing. IEEE, 75--84.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky et al. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688 (2016).Google ScholarGoogle Scholar
  12. Naomi S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. Amer. Statist. 46, 3 (1992), 175--185.Google ScholarGoogle Scholar
  13. Alexander A. Auer, Gerald Baumgartner, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan et al. 2006. Automatic code generation for many-body electronic structure methods: The tensor contraction engine. Molec. Phys. 104, 2 (2006), 211--228.Google ScholarGoogle ScholarCross RefCross Ref
  14. Austin R. Benson and Grey Ballard. 2015. A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vol. 50. ACM, 42--53.Google ScholarGoogle Scholar
  15. Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. 1984. Classification and regression trees. Belmont, CA: Wadsworth International Group (1984). https://doi.org/10.1201/9781315139470Google ScholarGoogle Scholar
  16. Cris Cecka. 2017. Pro Tip: cuBLAS Strided Batched Matrix Multiply. Retrieved from https://devblogs.nvidia.com/cublas-strided-batched-matrix-multiply/.Google ScholarGoogle Scholar
  17. Jack Dongarra. 2016. Sunway TaihuLight supercomputer makes its appearance. Nat. Sci. Rev. 3, 3 (2016), 265--266.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jack Dongarra, Iain Duff, Mark Gates, Azzam Haidar, Sven Hammarling, Nicholas J. Higham, Jonathon Hogg, Pedro Valero-Lara, Samuel D. Relton, Stanimire Tomov et al. 2016. A proposed API for batched basic linear algebra subprograms. Manchester Institute for Mathematical Sciences, University of Manchester (2016). http://eprints.ma.man.ac.uk/2464/.Google ScholarGoogle Scholar
  19. Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Samuel D. Relton, Pedro Valero-Lara, and Mawussi Zounon. 2017. The design and performance of batched BLAS on modern high-performance computing systems. Proc. Comput. Sci. 108 (2017), 495--504.Google ScholarGoogle ScholarCross RefCross Ref
  20. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (1990), 1--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin et al. 2017. 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2:1--2:12.Google ScholarGoogle Scholar
  22. Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Sci. China Inf. Sci. 59, 7 (2016), 072001.Google ScholarGoogle ScholarCross RefCross Ref
  23. Pawel Gepner, Victor Gamayunov, David L. Fraser, Eric Houdard, Ludovic Sauge, Damien Declat, and Mathieu Dubois. 2014. Evaluation of DGEMM implementation on Intel Xeon Phi coprocessor. J. Comput. 9, 7 (2014), 1566--1571.Google ScholarGoogle ScholarCross RefCross Ref
  24. Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3 (2008), 12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001. A family of high-performance matrix multiplication algorithms. In Proceedings of the International Conference on Computational Science. Springer, 51--60.Google ScholarGoogle Scholar
  26. Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 84:1--84:11.Google ScholarGoogle ScholarCross RefCross Ref
  27. Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G. Shet, George Chrysos, and Pradeep Dubey. 2013. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor. In Proceedings of the Parallel and Distributed Processing Sympsium (IPDPS’13). IEEE, 126--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jianyu Huang, Leslie Rice, Devin A. Matthews, and Robert A. van de Geijn. 2017. Generating families of practical fast matrix multiplication algorithms. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, 656--667.Google ScholarGoogle Scholar
  29. Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. 2016. Strassen’s algorithm reloaded. In Proceedings of the International Conference for High Performance Computing, Networking, Storage. and Analysis. IEEE Press, 59.Google ScholarGoogle Scholar
  30. Chetan Jhurani and Paul Mullowney. 2015. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. J. Parallel Distrib. Comput. 75 (2015), 133--140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lijuan Jiang, Chao Yang, Yulong Ao, Wanwang Yin, Wenjing Ma, Qiao Sun, Fangfang Liu, Rongfen Lin, and Peng Zhang. 2017. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In Proceedings of the International Conference on Parallel Processing (ICPP’17). IEEE, 422--431.Google ScholarGoogle ScholarCross RefCross Ref
  32. Ali Khodayari, Ali R. Zomorrodi, James C. Liao, and Costas D. Maranas. 2014. A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metab. Eng. 25 (2014), 50--62.Google ScholarGoogle ScholarCross RefCross Ref
  33. Xiuhong Li and Yun Liang. 2016. Efficient kernel management on GPUs. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 85--90.Google ScholarGoogle ScholarCross RefCross Ref
  34. Xiuhong Li, Yun Liang, Shengen Yan, Liancheng Jia, and Yinghan Li. 2019. A coordinated tiling and batching framework for efficient GEMM on GPUs. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. ACM, 229--241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yun Liang, Huynh Phung Huynh, Kyle Rupnow, Rick Siow Mong Goh, and Deming Chen. 2014. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26, 3 (2014), 748--760.Google ScholarGoogle ScholarCross RefCross Ref
  36. James Lin, Zhigeng Xu, Akira Nukada, Naoya Maruyama, and Satoshi Matsuoka. 2017. Optimizations of two compute-bound scientific kernels on the SW26010 many-core processor. In Proceedings of the International Conference on Parallel Processing (ICPP’17). IEEE, 432--441.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ian Masliah, Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Marc Baboulin, Joël Falcou, and Jack Dongarra. 2016. High-performance matrix-matrix multiplications of very small matrices. In Proceedings of the European Conference on Parallel Processing. 659--671.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ian Masliah, Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Marc Baboulin, Joël Falcou, and Jack Dongarra. 2019. Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices. Parallel Comput. 81 (2019), 1--21.Google ScholarGoogle ScholarCross RefCross Ref
  39. O. E. Messer, J. Austin Harris, Suzanne Parete-Koon, and Merek A. Chertkow. 2012. Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proceedings of the 11th International Conference on Applied Parallel and Scientific Computing. Springer-Verlag, 92--106.Google ScholarGoogle Scholar
  40. Naohito Nakasato. 2011. A fast GEMM implementation on the Cypress GPU. ACM SIGMETRICS Perf. Eval. Rev. 38, 4 (2011), 50--55.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An improved MAGMA GEMM for Fermi graphics processing units. Int. J. High Performance Computing Applications 24, 4 (2010), 511--515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google ScholarGoogle Scholar
  43. Martin D. Schatz, Robert A. Van de Geijn, and Jack Poulson. 2016. Parallel matrix multiplication: A systematic journey. SIAM J. Sci. Comput. 38, 6 (2016), C748--C781.Google ScholarGoogle ScholarCross RefCross Ref
  44. S. J. Sherwin and G. E. Karniadakis. 2005. Spectral/hp element methods for computational fluid dynamics. Oxford Sci. Public. 17 (2005), 18.Google ScholarGoogle Scholar
  45. Yang Shi, Uma Naresh Niranjan, Animashree Anandkumar, and Cris Cecka. 2016. Tensor contractions with extended BLAS kernels on CPU and GPU. In Proceedings of the International Conference on High Performance Computing (HiPC’16). IEEE, 193--202.Google ScholarGoogle ScholarCross RefCross Ref
  46. Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast implementation of DGEMM on Fermi GPU. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. ACM, 35:1--35:11.Google ScholarGoogle Scholar
  48. R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27.Google ScholarGoogle Scholar
  49. Halbert White. 1992. Artificial Neural Networks: Approximation and Learning Theory. Blackwell Publishers, Inc.Google ScholarGoogle Scholar
  50. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009).Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yulong Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, et al. 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 6:1--6:12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jian Zhang, Chunbao Zhou, Yangang Wang, Lili Ju, Qiang Du, Xuebin Chi, Dongsheng Xu, Dexun Chen, Yong Liu, and Zhao Liu. 2016. Extreme-scale phase field simulations of coarsening dynamics on the Sunway TaihuLight Supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 4:1--4:12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Fang Zheng, Hong-Liang Li, Hui Lv, Feng Guo, Xiao-Hong Xu, and Xiang-Hui Xie. 2015. Cooperative computing techniques for a deeply fused and heterogeneous many-core processor architecture. J. Comput. Sci. Technol. 30, 1 (2015), 145--162.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 1
        March 2020
        206 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3386454
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 March 2020
        • Accepted: 1 January 2020
        • Revised: 1 October 2019
        • Received: 1 August 2019
        Published in taco Volume 17, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format