Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications

Authors:
Yoav Moran

The Hebrew University of Jerusalem, Jerusalem, Israel

The Hebrew University of Jerusalem, Jerusalem, Israel

0009-0000-5813-347X
View Profile

,
Oded Schwartz

The Hebrew University of Jerusalem, Jerusalem, Israel

The Hebrew University of Jerusalem, Jerusalem, Israel

0000-0003-1309-5566
View Profile

SPAA '23: Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and ArchitecturesJune 2023Pages 379–390https://doi.org/10.1145/3558481.3591083

Published:17 June 2023Publication History

SPAA '23: Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures

Pages 379–390

ABSTRACT

Fast parallel and sequential matrix multiplication algorithms switch to the cubic time classical algorithm on small sub-blocks as the classical algorithm requires fewer operations on small blocks. We obtain a new algorithm that can outperform the classical one, even on small blocks, by trading multiplications with additions. This algorithm contradicts the common belief that the classical algorithm is the fastest algorithm for small blocks. To this end, we introduce commutative algorithms that generalize Winograd's folding technique (1968) and combine it with fast matrix multiplication algorithms. Thus, when a single scalar multiplication requires ρ times more clock cycles than an addition (e.g., for 16-bit integers on Intel's Skylake microarchitecture, ρ is between 1.5 and 5), our technique reduces the computation cost of multiplying the small sub-blocks by a factor of ρ + 3 over 2(ρ + 1) compared to using the classical algorithm, at the price of a low order term communication cost overhead both in the sequential and the parallel cases, thus reducing the total runtime of the algorithm. Our technique also reduces the energy cost of the algorithm. The ρ values for energy costs are typically larger than the ρ values for arithmetic costs. For example, we obtain an algorithm for multiplying 2 x 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2 x 2 matrices. However, we obtain this algorithm by bypassing the implicit assumptions of the lower bound. We provide a new lower bound matching our algorithm for 2 x 2 block multiplication, thus showing our technique is optimal.

Supplemental Material

SPAA23-fp88.mp4

mp4

39.4 MB

Download

References

Josh Alman and Virginia V. Williams. 2021. A refined laser method and faster matrix multiplication, In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA). Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, 522--539. https://doi.org/10.1137/1.9781611976465.32Google Scholar
N. Anderson and D. Manley. 1994. A matrix extension of Winograd's inner product algorithm. Theoretical Computer Science 131 (1994). Issue 2. https: //doi.org/10.1016/0304--3975(94)90186-4Google Scholar
Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal parallel algorithm for strassen's matrix multiplication, In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures. Annual ACM Symposium on Parallelism in Algorithms and Architectures, 193--204. https://doi.org/10.1145/2312005.2312044Google ScholarDigital Library
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Graph expansion and communication costs of fast matrix multiplication. Annual ACM Symposium on Parallelism in Algorithms and Architectures 59, 6 (2011), 1--23. https://doi.org/10.1145/1989493.1989495Google ScholarDigital Library
Gal Beniamini, Nathan Cheng, Olga Holtz, Elaye Karstadt, and Oded Schwartz. 2020. Sparsifying the Operators of Fast Matrix Multiplication Algorithms. arXiv preprint arXiv:2008.03759 (2020).Google Scholar
Gal Beniamini and Oded Schwartz. 2019. Faster matrix multiplication via sparse decomposition. Annual ACM Symposium on Parallelism in Algorithms and Architectures (6 2019), 11--22. https://doi.org/10.1145/3323165.3323188Google ScholarDigital Library
Austin R. Benson and Grey Ballard. 2015. A framework for practical parallel fast matrix multiplication. ACM SIGPLAN Notices 50 (2015). Issue 8. https: //doi.org/10.1145/2858788.2688513Google ScholarDigital Library
Dario A. Bini. 1980. Relations between exact and approximate bilinear algorithms. Applications. Calcolo 17 (1980). Issue 1. https://doi.org/10.1007/BF02575865Google Scholar
Markus Bläser. 1999. Lower bounds for the multiplicative complexity of matrix multiplication. Computational Complexity 8 (1999). Issue 3. https://doi.org/10. 1007/s000370050028Google Scholar
Markus Bläser. 2001. A 5 over 2 n 2-lower bound for the multiplicative complexity of n × n-matrix multiplication. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2010. https://doi.org/10.1007/3-540-44693-1_9Google Scholar
Richard P. Brent. 1970. Algorithms for matrix multiplication. Technical Report. Stanford University, CA. Department of computer science.Google Scholar
Richard P. Brent. 1970. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity. Numer. Math. 16 (1970). Issue 2. https://doi.org/10.1007/BF02308867Google ScholarDigital Library
Nader H. Bshouty. 1995. On the additive complexity of 2 × 2 matrix multiplication. Inform. Process. Lett. 56, 6 (1995), 329--335. Issue 6. https://doi.org/10.1016/0020-0190(95)00176-XGoogle ScholarDigital Library
Murat Cenk and M. Anwar Hasan. 2017. On the arithmetic complexity of Strassen-like matrix multiplications. Journal of Symbolic Computation 80 (2017), 484--501. https://doi.org/10.1016/j.jsc.2016.07.004Google ScholarDigital Library
Henry Cohn and Christopher Umans. 2003. A group-theoretic approach to fast matrix multiplication. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings. IEEE, 438--449.Google ScholarCross Ref
Don Coppersmith and Shmuel Winograd. 1990. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation 9 (1990). Issue 3. https://doi.org/10.1016/S0747-7171(08)80013-2Google ScholarDigital Library
Alexander M. Davie and Andrew J. Stothers. 2013. Improved bound for complexity of matrix multiplication. Proceedings of the Royal Society of Edinburgh Section A: Mathematics 143 A (2013). Issue 2. https://doi.org/10.1017/S0308210511001648Google Scholar
Hans F De Groote. 1987. Lectures on the complexity of bilinear problems. Vol. 245. Springer Science & Business Media.Google ScholarDigital Library
James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-optimal parallel recursive rectangular matrix multiplication. Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013. https://doi.org/10. 1109/IPDPS.2013.80Google ScholarDigital Library
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al . 2022. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 7930 (2022), 47--53.Google Scholar
Patrick C Fischer. 1974. Further schemes for combining matrix algorithms. In International Colloquium on Automata, Languages, and Programming. Springer, 428--436.Google Scholar
Agner Fog. 2022. Instruction tables. Technical University of Denmark (2022). https://www.agner.org/optimize/instruction_tables.pdfGoogle Scholar
François Le Gall. 2014. Powers of tensors and fast matrix multiplication, In Proceedings of the 39th international symposium on symbolic and algebraic computation. Proceedings of the International Symposium on Symbolic and Algebraic Computation, ISSAC, 296--303. https://doi.org/10.1145/2608628.2608664Google ScholarDigital Library
Marijn J.H. Heule, Manuel Kauers, and Martina Seidl. 2021. New ways to multiply 3 × 3-matrices. Journal of Symbolic Computation 104 (2021). https://doi.org/10. 1016/j.jsc.2020.10.003Google Scholar
John E Hopcroft and Leslie R Kerr. 1971. On minimizing the number of multi- plications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 1 (1971), 30--36.Google ScholarDigital Library
Joseph JáJá. 1980. On the Complexity of Bilinear Forms with Commutativity. SIAM J. Comput. 9 (1980). Issue 4. https://doi.org/10.1137/0209056Google ScholarDigital Library
Larisa D. Jelfimova. 2019. A New Fast Recursive Matrix Multiplication Algorithm. Cybernetics and Systems Analysis 55 (7 2019), 547--551. Issue 4. https://doi.org/ 10.1007/s10559-019-00163-2Google ScholarDigital Library
Larisa D. Jelfimova. 2021. A Fast Recursive Algorithm for Multiplying Matrices of Order n = 3q (q > 1). Cybernetics and Systems Analysis 57 (2021). Issue 2. https://doi.org/10.1007/s10559-021-00345-xGoogle ScholarDigital Library
Hong Jia-Wei and Hsiang-Tsung Kung. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing. 326--333.Google ScholarDigital Library
Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten lessons from three generations shaped Google's TPUv4i: Industrial product. Proceedings - International Symposium on Computer Architecture 2021-June. https://doi.org/10.1109/ISCA52012.2021.00010Google ScholarDigital Library
Elaye Karstadt and Oded Schwartz. 2020. Matrix Multiplication, a Little Faster. Journal of the ACM (JACM) 67, 1 (2020), 1--31.Google ScholarDigital Library
Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--22.Google ScholarDigital Library
Julian D. Laderman. 1976. A noncommutative algorithm for multiplying 3 × 3 matrices using 23 multiplications. In American Mathematical Society, Vol. 82. 126--128.Google ScholarCross Ref
Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel Strassen: Implementation and performance. International Conference for High Performance Computing, Networking, Storage and Analysis, SC. https://doi.org/10.1109/SC.2012.33Google ScholarDigital Library
Charles F. Van Loan. 2000. The ubiquitous Kronecker product. Journal of computational and applied mathematics 123, 1--2 (2000), 85--100. https://doi. org/10.1016/S0377-0427(00)00393-9Google Scholar
Roy Nissim and Oded Schwartz. 2019. Revisiting the I/O-complexity of fast matrix multiplication with recomputations. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 482--490.Google ScholarCross Ref
Victor Y. Pan. 1980. New Fast Algorithms for Matrix Operations. SIAM J. Comput. 9 (1980). Issue 2. https://doi.org/10.1137/0209027Google ScholarDigital Library
Victor Y. Pan. 1982. Trilinear aggregating with implicit canceling for a new acceleration of matrix multiplication. Computers and Mathematics with Applications 8 (1982). Issue 1. https://doi.org/10.1016/0898-1221(82)90037--2Google Scholar
Robert L Probert. 1976. On the additive complexity of matrix multiplication. SIAM J. Comput. 5, 2 (1976), 187--203.Google ScholarDigital Library
Andreas Rosowski. 2019. Fast Commutative Matrix Algorithm. arXiv preprint arXiv:1904.07683 (4 2019). http://arxiv.org/abs/1904.07683Google Scholar
Arnold Schönhage. 1981. Partial and Total Matrix Multiplication. SIAM J. Comput. 10 (1981). Issue 3. https://doi.org/10.1137/0210032Google ScholarDigital Library
Jacob Scott, Olga Holtz, and Oded Schwartz. 2015. Matrix multiplication I/O-complexity by path routing. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures. 35--45.Google ScholarDigital Library
Alexey V. Smirnov. 2013. The bilinear complexity and practical algorithms for matrix multiplication. Computational Mathematics and Mathematical Physics 53 (2013). Issue 12. https://doi.org/10.1134/S0965542513120129Google Scholar
Lorenzo De Stefani. 2019. The I/O Complexity of Hybrid Algorithms for Square Matrix Multiplication. In 30th International Symposium on Algorithms and Computation, ISAAC 2019, December 8-11, 2019, Shanghai University of Finance and Economics, Shanghai, China (LIPIcs, Vol. 149), Pinyan Lu and Guochuan Zhang (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 33:1--33:16. https://doi.org/10.4230/LIPIcs.ISAAC.2019.33Google Scholar
Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13 (1969). Issue 4. https://doi.org/10.1007/BF02165411Google ScholarDigital Library
Volker Strassen. 1973. Vermeidung von Divisionen. Journal fur die Reine und Angewandte Mathematik 1973 (1973). Issue 264. https://doi.org/10.1515/crll.1973. 264.184Google Scholar
Abraham Waksman. 1970. On Winograd's Algorithm for Inner Products. IEEE Trans. Comput. C-19 (1970). Issue 4. https://doi.org/10.1109/T-C.1970.222926Google ScholarDigital Library
Virginia V. Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd, In Proceedings of the forty-fourth annual ACM symposium on Theory of computing. Proceedings of the Annual ACM Symposium on Theory of Computing, 887--898. https://doi.org/10.1145/2213977.2214056Google ScholarDigital Library
Shmuel Winograd. 1968. A New Algorithm for Inner Product. IEEE Trans. Comput. C-17 (1968), 693--694. Issue 7. https://doi.org/10.1109/TC.1968.227420Google ScholarDigital Library
Shmuel Winograd. 1968. On the number of multiplications necessary to compute certain functions. Communications on Pure and Applied Mathematics 23 (1968). Issue 2. https://doi.org/10.1002/cpa.3160230204Google Scholar
Shmuel Winograd. 1971. On multiplication of 2 × 2 matrices. Linear Algebra and Its Applications 4, 4 (1971), 381--388. Issue 4. https://doi.org/10.1016/0024-3795(71)90009-7Google ScholarCross Ref
Shmuel Winograd. 1976. Private communication with R. Probert [39].Google Scholar
Cui Qing Yang and Barton P. Miller. 1988. Critical path analysis for the execution of parallel and distributed programs. Proceedings - International Conference on Distributed Computing Systems 8. https://doi.org/10.1109/dcs.1988.12538Google Scholar

Index Terms

Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications
1. Theory of computation
  1. Design and analysis of algorithms
    1. Algorithm design techniques
      1. Divide and conquer
    2. Parallel algorithms

Recommendations

Stability of a method for multiplying complex matrices with three real matrix multiplications
Read More
On the Number of Multiplications Required for Matrix Multiplication

In this paper we give a new algorithm for matrix multiplication which for n large uses $n^2 + o(n^2 )$ multiplications to multiply $n \times p$ matrices by $p \times n$ matrices provided $p \leqq \log _2 n$. Multiplication and division by 2 is necessary ...
Read More
Fast Modulo 2^{n} - (2^{n - 2} + 1) Addition: A New Class of Adder for RNS

Efficient modular adder architectures are invaluable to the design of residue number system (RNS)-based digital systems. For example, they are used to perform residue encoding and decoding, modular multiplication, and scaling. This work is a first in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '23: Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures
June 2023
504 pages
ISBN:9781450395458
DOI:10.1145/3558481
General Chair:
Kunal Agrawal
Washington University in St. Louis, USA
,
Program Chair:
Julian Shun
MIT, USA
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 June 2023
Check for updates
Author Tags
commutative matrix multiplication
matrix multiplication
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate447of1,461submissions,31%
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 118
  Total Downloads
- Downloads (Last 12 months)118
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications

SPAA '23: Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Stability of a method for multiplying complex matrices with three real matrix multiplications

On the Number of Multiplications Required for Matrix Multiplication

Fast Modulo 2^{n} - (2^{n - 2} + 1) Addition: A New Class of Adder for RNS