ABSTRACT
Fast parallel and sequential matrix multiplication algorithms switch to the cubic time classical algorithm on small sub-blocks as the classical algorithm requires fewer operations on small blocks. We obtain a new algorithm that can outperform the classical one, even on small blocks, by trading multiplications with additions. This algorithm contradicts the common belief that the classical algorithm is the fastest algorithm for small blocks. To this end, we introduce commutative algorithms that generalize Winograd's folding technique (1968) and combine it with fast matrix multiplication algorithms. Thus, when a single scalar multiplication requires ρ times more clock cycles than an addition (e.g., for 16-bit integers on Intel's Skylake microarchitecture, ρ is between 1.5 and 5), our technique reduces the computation cost of multiplying the small sub-blocks by a factor of ρ + 3 over 2(ρ + 1) compared to using the classical algorithm, at the price of a low order term communication cost overhead both in the sequential and the parallel cases, thus reducing the total runtime of the algorithm. Our technique also reduces the energy cost of the algorithm. The ρ values for energy costs are typically larger than the ρ values for arithmetic costs. For example, we obtain an algorithm for multiplying 2 x 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2 x 2 matrices. However, we obtain this algorithm by bypassing the implicit assumptions of the lower bound. We provide a new lower bound matching our algorithm for 2 x 2 block multiplication, thus showing our technique is optimal.
Supplemental Material
- Josh Alman and Virginia V. Williams. 2021. A refined laser method and faster matrix multiplication, In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA). Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, 522--539. https://doi.org/10.1137/1.9781611976465.32Google Scholar
- N. Anderson and D. Manley. 1994. A matrix extension of Winograd's inner product algorithm. Theoretical Computer Science 131 (1994). Issue 2. https: //doi.org/10.1016/0304--3975(94)90186-4Google Scholar
- Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal parallel algorithm for strassen's matrix multiplication, In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures. Annual ACM Symposium on Parallelism in Algorithms and Architectures, 193--204. https://doi.org/10.1145/2312005.2312044Google ScholarDigital Library
- Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Graph expansion and communication costs of fast matrix multiplication. Annual ACM Symposium on Parallelism in Algorithms and Architectures 59, 6 (2011), 1--23. https://doi.org/10.1145/1989493.1989495Google ScholarDigital Library
- Gal Beniamini, Nathan Cheng, Olga Holtz, Elaye Karstadt, and Oded Schwartz. 2020. Sparsifying the Operators of Fast Matrix Multiplication Algorithms. arXiv preprint arXiv:2008.03759 (2020).Google Scholar
- Gal Beniamini and Oded Schwartz. 2019. Faster matrix multiplication via sparse decomposition. Annual ACM Symposium on Parallelism in Algorithms and Architectures (6 2019), 11--22. https://doi.org/10.1145/3323165.3323188Google ScholarDigital Library
- Austin R. Benson and Grey Ballard. 2015. A framework for practical parallel fast matrix multiplication. ACM SIGPLAN Notices 50 (2015). Issue 8. https: //doi.org/10.1145/2858788.2688513Google ScholarDigital Library
- Dario A. Bini. 1980. Relations between exact and approximate bilinear algorithms. Applications. Calcolo 17 (1980). Issue 1. https://doi.org/10.1007/BF02575865Google Scholar
- Markus Bläser. 1999. Lower bounds for the multiplicative complexity of matrix multiplication. Computational Complexity 8 (1999). Issue 3. https://doi.org/10. 1007/s000370050028Google Scholar
- Markus Bläser. 2001. A 5 over 2 n 2-lower bound for the multiplicative complexity of n × n-matrix multiplication. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2010. https://doi.org/10.1007/3-540-44693-1_9Google Scholar
- Richard P. Brent. 1970. Algorithms for matrix multiplication. Technical Report. Stanford University, CA. Department of computer science.Google Scholar
- Richard P. Brent. 1970. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity. Numer. Math. 16 (1970). Issue 2. https://doi.org/10.1007/BF02308867Google ScholarDigital Library
- Nader H. Bshouty. 1995. On the additive complexity of 2 × 2 matrix multiplication. Inform. Process. Lett. 56, 6 (1995), 329--335. Issue 6. https://doi.org/10.1016/0020-0190(95)00176-XGoogle ScholarDigital Library
- Murat Cenk and M. Anwar Hasan. 2017. On the arithmetic complexity of Strassen-like matrix multiplications. Journal of Symbolic Computation 80 (2017), 484--501. https://doi.org/10.1016/j.jsc.2016.07.004Google ScholarDigital Library
- Henry Cohn and Christopher Umans. 2003. A group-theoretic approach to fast matrix multiplication. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings. IEEE, 438--449.Google ScholarCross Ref
- Don Coppersmith and Shmuel Winograd. 1990. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation 9 (1990). Issue 3. https://doi.org/10.1016/S0747-7171(08)80013-2Google ScholarDigital Library
- Alexander M. Davie and Andrew J. Stothers. 2013. Improved bound for complexity of matrix multiplication. Proceedings of the Royal Society of Edinburgh Section A: Mathematics 143 A (2013). Issue 2. https://doi.org/10.1017/S0308210511001648Google Scholar
- Hans F De Groote. 1987. Lectures on the complexity of bilinear problems. Vol. 245. Springer Science & Business Media.Google ScholarDigital Library
- James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-optimal parallel recursive rectangular matrix multiplication. Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013. https://doi.org/10. 1109/IPDPS.2013.80Google ScholarDigital Library
- Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al . 2022. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 7930 (2022), 47--53.Google Scholar
- Patrick C Fischer. 1974. Further schemes for combining matrix algorithms. In International Colloquium on Automata, Languages, and Programming. Springer, 428--436.Google Scholar
- Agner Fog. 2022. Instruction tables. Technical University of Denmark (2022). https://www.agner.org/optimize/instruction_tables.pdfGoogle Scholar
- François Le Gall. 2014. Powers of tensors and fast matrix multiplication, In Proceedings of the 39th international symposium on symbolic and algebraic computation. Proceedings of the International Symposium on Symbolic and Algebraic Computation, ISSAC, 296--303. https://doi.org/10.1145/2608628.2608664Google ScholarDigital Library
- Marijn J.H. Heule, Manuel Kauers, and Martina Seidl. 2021. New ways to multiply 3 × 3-matrices. Journal of Symbolic Computation 104 (2021). https://doi.org/10. 1016/j.jsc.2020.10.003Google Scholar
- John E Hopcroft and Leslie R Kerr. 1971. On minimizing the number of multi- plications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 1 (1971), 30--36.Google ScholarDigital Library
- Joseph JáJá. 1980. On the Complexity of Bilinear Forms with Commutativity. SIAM J. Comput. 9 (1980). Issue 4. https://doi.org/10.1137/0209056Google ScholarDigital Library
- Larisa D. Jelfimova. 2019. A New Fast Recursive Matrix Multiplication Algorithm. Cybernetics and Systems Analysis 55 (7 2019), 547--551. Issue 4. https://doi.org/ 10.1007/s10559-019-00163-2Google ScholarDigital Library
- Larisa D. Jelfimova. 2021. A Fast Recursive Algorithm for Multiplying Matrices of Order n = 3q (q > 1). Cybernetics and Systems Analysis 57 (2021). Issue 2. https://doi.org/10.1007/s10559-021-00345-xGoogle ScholarDigital Library
- Hong Jia-Wei and Hsiang-Tsung Kung. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing. 326--333.Google ScholarDigital Library
- Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten lessons from three generations shaped Google's TPUv4i: Industrial product. Proceedings - International Symposium on Computer Architecture 2021-June. https://doi.org/10.1109/ISCA52012.2021.00010Google ScholarDigital Library
- Elaye Karstadt and Oded Schwartz. 2020. Matrix Multiplication, a Little Faster. Journal of the ACM (JACM) 67, 1 (2020), 1--31.Google ScholarDigital Library
- Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--22.Google ScholarDigital Library
- Julian D. Laderman. 1976. A noncommutative algorithm for multiplying 3 × 3 matrices using 23 multiplications. In American Mathematical Society, Vol. 82. 126--128.Google ScholarCross Ref
- Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel Strassen: Implementation and performance. International Conference for High Performance Computing, Networking, Storage and Analysis, SC. https://doi.org/10.1109/SC.2012.33Google ScholarDigital Library
- Charles F. Van Loan. 2000. The ubiquitous Kronecker product. Journal of computational and applied mathematics 123, 1--2 (2000), 85--100. https://doi. org/10.1016/S0377-0427(00)00393-9Google Scholar
- Roy Nissim and Oded Schwartz. 2019. Revisiting the I/O-complexity of fast matrix multiplication with recomputations. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 482--490.Google ScholarCross Ref
- Victor Y. Pan. 1980. New Fast Algorithms for Matrix Operations. SIAM J. Comput. 9 (1980). Issue 2. https://doi.org/10.1137/0209027Google ScholarDigital Library
- Victor Y. Pan. 1982. Trilinear aggregating with implicit canceling for a new acceleration of matrix multiplication. Computers and Mathematics with Applications 8 (1982). Issue 1. https://doi.org/10.1016/0898-1221(82)90037--2Google Scholar
- Robert L Probert. 1976. On the additive complexity of matrix multiplication. SIAM J. Comput. 5, 2 (1976), 187--203.Google ScholarDigital Library
- Andreas Rosowski. 2019. Fast Commutative Matrix Algorithm. arXiv preprint arXiv:1904.07683 (4 2019). http://arxiv.org/abs/1904.07683Google Scholar
- Arnold Schönhage. 1981. Partial and Total Matrix Multiplication. SIAM J. Comput. 10 (1981). Issue 3. https://doi.org/10.1137/0210032Google ScholarDigital Library
- Jacob Scott, Olga Holtz, and Oded Schwartz. 2015. Matrix multiplication I/O-complexity by path routing. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures. 35--45.Google ScholarDigital Library
- Alexey V. Smirnov. 2013. The bilinear complexity and practical algorithms for matrix multiplication. Computational Mathematics and Mathematical Physics 53 (2013). Issue 12. https://doi.org/10.1134/S0965542513120129Google Scholar
- Lorenzo De Stefani. 2019. The I/O Complexity of Hybrid Algorithms for Square Matrix Multiplication. In 30th International Symposium on Algorithms and Computation, ISAAC 2019, December 8-11, 2019, Shanghai University of Finance and Economics, Shanghai, China (LIPIcs, Vol. 149), Pinyan Lu and Guochuan Zhang (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 33:1--33:16. https://doi.org/10.4230/LIPIcs.ISAAC.2019.33Google Scholar
- Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13 (1969). Issue 4. https://doi.org/10.1007/BF02165411Google ScholarDigital Library
- Volker Strassen. 1973. Vermeidung von Divisionen. Journal fur die Reine und Angewandte Mathematik 1973 (1973). Issue 264. https://doi.org/10.1515/crll.1973. 264.184Google Scholar
- Abraham Waksman. 1970. On Winograd's Algorithm for Inner Products. IEEE Trans. Comput. C-19 (1970). Issue 4. https://doi.org/10.1109/T-C.1970.222926Google ScholarDigital Library
- Virginia V. Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd, In Proceedings of the forty-fourth annual ACM symposium on Theory of computing. Proceedings of the Annual ACM Symposium on Theory of Computing, 887--898. https://doi.org/10.1145/2213977.2214056Google ScholarDigital Library
- Shmuel Winograd. 1968. A New Algorithm for Inner Product. IEEE Trans. Comput. C-17 (1968), 693--694. Issue 7. https://doi.org/10.1109/TC.1968.227420Google ScholarDigital Library
- Shmuel Winograd. 1968. On the number of multiplications necessary to compute certain functions. Communications on Pure and Applied Mathematics 23 (1968). Issue 2. https://doi.org/10.1002/cpa.3160230204Google Scholar
- Shmuel Winograd. 1971. On multiplication of 2 × 2 matrices. Linear Algebra and Its Applications 4, 4 (1971), 381--388. Issue 4. https://doi.org/10.1016/0024-3795(71)90009-7Google ScholarCross Ref
- Shmuel Winograd. 1976. Private communication with R. Probert [39].Google Scholar
- Cui Qing Yang and Barton P. Miller. 1988. Critical path analysis for the execution of parallel and distributed programs. Proceedings - International Conference on Distributed Computing Systems 8. https://doi.org/10.1109/dcs.1988.12538Google Scholar
Index Terms
- Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications
Recommendations
On the Number of Multiplications Required for Matrix Multiplication
In this paper we give a new algorithm for matrix multiplication which for n large uses $n^2 + o(n^2 )$ multiplications to multiply $n \times p$ matrices by $p \times n$ matrices provided $p \leqq \log _2 n$. Multiplication and division by 2 is necessary ...
Fast Modulo 2^{n} - (2^{n - 2} + 1) Addition: A New Class of Adder for RNS
Efficient modular adder architectures are invaluable to the design of residue number system (RNS)-based digital systems. For example, they are used to perform residue encoding and decoding, modular multiplication, and scaling. This work is a first in ...
Comments