ABSTRACT
To respond to the need for efficient training and inference of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is the design for efficiently computing a dense matrix product of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including dense and sparse matrix multiplication and the Discrete Fourier Transform. We finally highlight a relation between the TCU model and the external memory model.
- G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. J. ACM, 59(6):32:1--32:23, 2013.Google ScholarDigital Library
- R. A. Chowdhury, F. Silvestri, and F. Vella. A computational model for tensor core units, 2020. Arxiv 1908.06649.Google Scholar
- A. Dakkak, C. Li, J. Xiong, I. Gelado, and W.-M. Hwu. Accelerating reduction and scan using tensor core units. In Proc. Int. Conf. on Supercomputing (ICS), pages 46--57, 2019.Google ScholarDigital Library
- R. Jacob and M. Stöckel. Fast output-sensitive matrix multiplication. In Proc. European Symposium on Algorithms (ESA), pages 766--778, 2015.Google ScholarCross Ref
- N. P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit. In Proc. 44th Int. Symposium on Computer Architecture (ISCA), pages 1--12, 2017.Google ScholarDigital Library
- Nvidia Tesla V100 GPU architecture. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google Scholar
- R. Raz. On the complexity of matrix product. SIAM Journal on Computing, 32(5):1356--1369, 2003.Google ScholarDigital Library
- A. Sorna, X. Cheng, E. D'Azevedo, K. Won, and S. Tomov. Optimizing the fast fourier transform using mixed precision on tensor core hardware. In Proc. 25th Int. Conf. on High Performance Computing Workshops (HiPCW), pages 3--7, 2018.Google ScholarCross Ref
- J. S. Vitter. Algorithms and data structures for external memory. Foundations and Trends in Theoretical Computer Science, 2(4):305--474, 2006.Google ScholarDigital Library
Index Terms
- A Computational Model for Tensor Core Units
Recommendations
Efficient tensor core-based GPU kernels for structured sparsity under reduced precision
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisThe success of DNN comes at the expense of excessive memory/computation cost, which can be addressed by exploiting reduced precision and sparsity jointly. Existing sparse GPU kernels, however, fail to achieve practical speedup over cuBLASHgemm under half-...
Toward accelerated stencil computation by adapting tensor core unit on GPU
ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingThe Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly ...
EGEMM-TC: accelerating scientific computing on tensor cores with extended precision
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingNvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision ...
Comments