Abstract
We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning–based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix shapes and batch sizes with low overhead and high accuracy. In particular, the optimized batched matrix multiplications can substantially outperform the non-batched version and reach around 84.8% of the performance upper bound.
- Intel Corporation. 2019. https://software.intel.com/en-us/intel-mkl.Google Scholar
- NVIDIA Corporation. 2019. https://docs.nvidia.com/cuda/cublas/.Google Scholar
- Eigen project. 2019. http://eigen.tuxfamily.org/index.php?title=Main_Page.Google Scholar
- LAPACK project. 2019. http://www.netlib.org/lapack/.Google Scholar
- MAGMA project. 2019. http://icl.cs.utk.edu/magma/.Google Scholar
- Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org 1, 2 (2015). http://dx.doi.org/10.1177/1094342010385729Google Scholar
- Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack Dongarra, Christopher Earl, Joel Falcou, Azzam Haidar, Ian Karlin, Tz Kolev, Ian Masliah et al. 2016. High-performance tensor contractions for GPUs. Proc. Comput. Sci. 80 (2016). Elsevier, 108--118.Google Scholar
- Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, design, and autotuning of batched GEMM for GPUs. In Proceedings of the International Conference on High Performance Computing. Springer, 21--38.Google ScholarCross Ref
- Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2017. Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. In Proceedings of the International Conference on Supercomputing. ACM, 5:1--5:10.Google ScholarDigital Library
- Emmanuel Agullo, Luc Giraud, and Mawussi Zounon. 2015. On the resilience of parallel sparse hybrid solvers. In Proceedings of the International Conference on High Performance Computing. IEEE, 75--84.Google ScholarDigital Library
- Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky et al. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688 (2016).Google Scholar
- Naomi S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. Amer. Statist. 46, 3 (1992), 175--185.Google Scholar
- Alexander A. Auer, Gerald Baumgartner, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan et al. 2006. Automatic code generation for many-body electronic structure methods: The tensor contraction engine. Molec. Phys. 104, 2 (2006), 211--228.Google ScholarCross Ref
- Austin R. Benson and Grey Ballard. 2015. A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vol. 50. ACM, 42--53.Google Scholar
- Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. 1984. Classification and regression trees. Belmont, CA: Wadsworth International Group (1984). https://doi.org/10.1201/9781315139470Google Scholar
- Cris Cecka. 2017. Pro Tip: cuBLAS Strided Batched Matrix Multiply. Retrieved from https://devblogs.nvidia.com/cublas-strided-batched-matrix-multiply/.Google Scholar
- Jack Dongarra. 2016. Sunway TaihuLight supercomputer makes its appearance. Nat. Sci. Rev. 3, 3 (2016), 265--266.Google ScholarCross Ref
- Jack Dongarra, Iain Duff, Mark Gates, Azzam Haidar, Sven Hammarling, Nicholas J. Higham, Jonathon Hogg, Pedro Valero-Lara, Samuel D. Relton, Stanimire Tomov et al. 2016. A proposed API for batched basic linear algebra subprograms. Manchester Institute for Mathematical Sciences, University of Manchester (2016). http://eprints.ma.man.ac.uk/2464/.Google Scholar
- Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Samuel D. Relton, Pedro Valero-Lara, and Mawussi Zounon. 2017. The design and performance of batched BLAS on modern high-performance computing systems. Proc. Comput. Sci. 108 (2017), 495--504.Google ScholarCross Ref
- Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (1990), 1--17.Google ScholarDigital Library
- Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin et al. 2017. 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2:1--2:12.Google Scholar
- Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Sci. China Inf. Sci. 59, 7 (2016), 072001.Google ScholarCross Ref
- Pawel Gepner, Victor Gamayunov, David L. Fraser, Eric Houdard, Ludovic Sauge, Damien Declat, and Mathieu Dubois. 2014. Evaluation of DGEMM implementation on Intel Xeon Phi coprocessor. J. Comput. 9, 7 (2014), 1566--1571.Google ScholarCross Ref
- Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3 (2008), 12.Google ScholarDigital Library
- John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001. A family of high-performance matrix multiplication algorithms. In Proceedings of the International Conference on Computational Science. Springer, 51--60.Google Scholar
- Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 84:1--84:11.Google ScholarCross Ref
- Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G. Shet, George Chrysos, and Pradeep Dubey. 2013. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor. In Proceedings of the Parallel and Distributed Processing Sympsium (IPDPS’13). IEEE, 126--137.Google ScholarDigital Library
- Jianyu Huang, Leslie Rice, Devin A. Matthews, and Robert A. van de Geijn. 2017. Generating families of practical fast matrix multiplication algorithms. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, 656--667.Google Scholar
- Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. 2016. Strassen’s algorithm reloaded. In Proceedings of the International Conference for High Performance Computing, Networking, Storage. and Analysis. IEEE Press, 59.Google Scholar
- Chetan Jhurani and Paul Mullowney. 2015. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. J. Parallel Distrib. Comput. 75 (2015), 133--140.Google ScholarDigital Library
- Lijuan Jiang, Chao Yang, Yulong Ao, Wanwang Yin, Wenjing Ma, Qiao Sun, Fangfang Liu, Rongfen Lin, and Peng Zhang. 2017. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In Proceedings of the International Conference on Parallel Processing (ICPP’17). IEEE, 422--431.Google ScholarCross Ref
- Ali Khodayari, Ali R. Zomorrodi, James C. Liao, and Costas D. Maranas. 2014. A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metab. Eng. 25 (2014), 50--62.Google ScholarCross Ref
- Xiuhong Li and Yun Liang. 2016. Efficient kernel management on GPUs. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 85--90.Google ScholarCross Ref
- Xiuhong Li, Yun Liang, Shengen Yan, Liancheng Jia, and Yinghan Li. 2019. A coordinated tiling and batching framework for efficient GEMM on GPUs. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. ACM, 229--241.Google ScholarDigital Library
- Yun Liang, Huynh Phung Huynh, Kyle Rupnow, Rick Siow Mong Goh, and Deming Chen. 2014. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26, 3 (2014), 748--760.Google ScholarCross Ref
- James Lin, Zhigeng Xu, Akira Nukada, Naoya Maruyama, and Satoshi Matsuoka. 2017. Optimizations of two compute-bound scientific kernels on the SW26010 many-core processor. In Proceedings of the International Conference on Parallel Processing (ICPP’17). IEEE, 432--441.Google ScholarCross Ref
- Ian Masliah, Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Marc Baboulin, Joël Falcou, and Jack Dongarra. 2016. High-performance matrix-matrix multiplications of very small matrices. In Proceedings of the European Conference on Parallel Processing. 659--671.Google ScholarDigital Library
- Ian Masliah, Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Marc Baboulin, Joël Falcou, and Jack Dongarra. 2019. Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices. Parallel Comput. 81 (2019), 1--21.Google ScholarCross Ref
- O. E. Messer, J. Austin Harris, Suzanne Parete-Koon, and Merek A. Chertkow. 2012. Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proceedings of the 11th International Conference on Applied Parallel and Scientific Computing. Springer-Verlag, 92--106.Google Scholar
- Naohito Nakasato. 2011. A fast GEMM implementation on the Cypress GPU. ACM SIGMETRICS Perf. Eval. Rev. 38, 4 (2011), 50--55.Google ScholarDigital Library
- Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An improved MAGMA GEMM for Fermi graphics processing units. Int. J. High Performance Computing Applications 24, 4 (2010), 511--515.Google ScholarDigital Library
- NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google Scholar
- Martin D. Schatz, Robert A. Van de Geijn, and Jack Poulson. 2016. Parallel matrix multiplication: A systematic journey. SIAM J. Sci. Comput. 38, 6 (2016), C748--C781.Google ScholarCross Ref
- S. J. Sherwin and G. E. Karniadakis. 2005. Spectral/hp element methods for computational fluid dynamics. Oxford Sci. Public. 17 (2005), 18.Google Scholar
- Yang Shi, Uma Naresh Niranjan, Animashree Anandkumar, and Cris Cecka. 2016. Tensor contractions with extended BLAS kernels on CPU and GPU. In Proceedings of the International Conference on High Performance Computing (HiPC’16). IEEE, 193--202.Google ScholarCross Ref
- Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356.Google ScholarDigital Library
- Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast implementation of DGEMM on Fermi GPU. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. ACM, 35:1--35:11.Google Scholar
- R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27.Google Scholar
- Halbert White. 1992. Artificial Neural Networks: Approximation and Learning Theory. Blackwell Publishers, Inc.Google Scholar
- Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009).Google ScholarDigital Library
- Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yulong Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, et al. 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 6:1--6:12.Google ScholarDigital Library
- Jian Zhang, Chunbao Zhou, Yangang Wang, Lili Ju, Qiang Du, Xuebin Chi, Dongsheng Xu, Dexun Chen, Yong Liu, and Zhao Liu. 2016. Extreme-scale phase field simulations of coarsening dynamics on the Sunway TaihuLight Supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 4:1--4:12.Google ScholarDigital Library
- Fang Zheng, Hong-Liang Li, Hui Lv, Feng Guo, Xiao-Hong Xu, and Xiang-Hui Xie. 2015. Cooperative computing techniques for a deeply fused and heterogeneous many-core processor architecture. J. Comput. Sci. Technol. 30, 1 (2015), 145--162.Google ScholarCross Ref
Index Terms
Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor
Recommendations
A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingThe sparse triangular solver (SpTRSV) is one of the most essential kernels in many scientific and engineering applications. Efficiently parallelizing the SpTRSV on modern many-core architectures is considerably difficult due to inherent dependency of ...
Designing and dynamically load balancing hybrid LU for multi/many-core
Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show ...
Efficient Processing of Convolutional Neural Networks on SW26010
Network and Parallel ComputingAbstractArtificial intelligence has developed rapidly in recent years. Deep neural networks are the basis of many artificial intelligence applications. How to accelerate the computational processing of deep neural networks is very important. To explor the ...
Comments