skip to main content
10.1145/3558481.3591083acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article
Open Access

Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications

Published:17 June 2023Publication History

ABSTRACT

Fast parallel and sequential matrix multiplication algorithms switch to the cubic time classical algorithm on small sub-blocks as the classical algorithm requires fewer operations on small blocks. We obtain a new algorithm that can outperform the classical one, even on small blocks, by trading multiplications with additions. This algorithm contradicts the common belief that the classical algorithm is the fastest algorithm for small blocks. To this end, we introduce commutative algorithms that generalize Winograd's folding technique (1968) and combine it with fast matrix multiplication algorithms. Thus, when a single scalar multiplication requires ρ times more clock cycles than an addition (e.g., for 16-bit integers on Intel's Skylake microarchitecture, ρ is between 1.5 and 5), our technique reduces the computation cost of multiplying the small sub-blocks by a factor of ρ + 3 over 2(ρ + 1) compared to using the classical algorithm, at the price of a low order term communication cost overhead both in the sequential and the parallel cases, thus reducing the total runtime of the algorithm. Our technique also reduces the energy cost of the algorithm. The ρ values for energy costs are typically larger than the ρ values for arithmetic costs. For example, we obtain an algorithm for multiplying 2 x 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2 x 2 matrices. However, we obtain this algorithm by bypassing the implicit assumptions of the lower bound. We provide a new lower bound matching our algorithm for 2 x 2 block multiplication, thus showing our technique is optimal.

Skip Supplemental Material Section

Supplemental Material

SPAA23-fp88.mp4

mp4

39.4 MB

References

  1. Josh Alman and Virginia V. Williams. 2021. A refined laser method and faster matrix multiplication, In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA). Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, 522--539. https://doi.org/10.1137/1.9781611976465.32Google ScholarGoogle Scholar
  2. N. Anderson and D. Manley. 1994. A matrix extension of Winograd's inner product algorithm. Theoretical Computer Science 131 (1994). Issue 2. https: //doi.org/10.1016/0304--3975(94)90186-4Google ScholarGoogle Scholar
  3. Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal parallel algorithm for strassen's matrix multiplication, In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures. Annual ACM Symposium on Parallelism in Algorithms and Architectures, 193--204. https://doi.org/10.1145/2312005.2312044Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Graph expansion and communication costs of fast matrix multiplication. Annual ACM Symposium on Parallelism in Algorithms and Architectures 59, 6 (2011), 1--23. https://doi.org/10.1145/1989493.1989495Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Gal Beniamini, Nathan Cheng, Olga Holtz, Elaye Karstadt, and Oded Schwartz. 2020. Sparsifying the Operators of Fast Matrix Multiplication Algorithms. arXiv preprint arXiv:2008.03759 (2020).Google ScholarGoogle Scholar
  6. Gal Beniamini and Oded Schwartz. 2019. Faster matrix multiplication via sparse decomposition. Annual ACM Symposium on Parallelism in Algorithms and Architectures (6 2019), 11--22. https://doi.org/10.1145/3323165.3323188Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Austin R. Benson and Grey Ballard. 2015. A framework for practical parallel fast matrix multiplication. ACM SIGPLAN Notices 50 (2015). Issue 8. https: //doi.org/10.1145/2858788.2688513Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dario A. Bini. 1980. Relations between exact and approximate bilinear algorithms. Applications. Calcolo 17 (1980). Issue 1. https://doi.org/10.1007/BF02575865Google ScholarGoogle Scholar
  9. Markus Bläser. 1999. Lower bounds for the multiplicative complexity of matrix multiplication. Computational Complexity 8 (1999). Issue 3. https://doi.org/10. 1007/s000370050028Google ScholarGoogle Scholar
  10. Markus Bläser. 2001. A 5 over 2 n 2-lower bound for the multiplicative complexity of n × n-matrix multiplication. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2010. https://doi.org/10.1007/3-540-44693-1_9Google ScholarGoogle Scholar
  11. Richard P. Brent. 1970. Algorithms for matrix multiplication. Technical Report. Stanford University, CA. Department of computer science.Google ScholarGoogle Scholar
  12. Richard P. Brent. 1970. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity. Numer. Math. 16 (1970). Issue 2. https://doi.org/10.1007/BF02308867Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nader H. Bshouty. 1995. On the additive complexity of 2 × 2 matrix multiplication. Inform. Process. Lett. 56, 6 (1995), 329--335. Issue 6. https://doi.org/10.1016/0020-0190(95)00176-XGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  14. Murat Cenk and M. Anwar Hasan. 2017. On the arithmetic complexity of Strassen-like matrix multiplications. Journal of Symbolic Computation 80 (2017), 484--501. https://doi.org/10.1016/j.jsc.2016.07.004Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Henry Cohn and Christopher Umans. 2003. A group-theoretic approach to fast matrix multiplication. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings. IEEE, 438--449.Google ScholarGoogle ScholarCross RefCross Ref
  16. Don Coppersmith and Shmuel Winograd. 1990. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation 9 (1990). Issue 3. https://doi.org/10.1016/S0747-7171(08)80013-2Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Alexander M. Davie and Andrew J. Stothers. 2013. Improved bound for complexity of matrix multiplication. Proceedings of the Royal Society of Edinburgh Section A: Mathematics 143 A (2013). Issue 2. https://doi.org/10.1017/S0308210511001648Google ScholarGoogle Scholar
  18. Hans F De Groote. 1987. Lectures on the complexity of bilinear problems. Vol. 245. Springer Science & Business Media.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-optimal parallel recursive rectangular matrix multiplication. Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013. https://doi.org/10. 1109/IPDPS.2013.80Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al . 2022. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 7930 (2022), 47--53.Google ScholarGoogle Scholar
  21. Patrick C Fischer. 1974. Further schemes for combining matrix algorithms. In International Colloquium on Automata, Languages, and Programming. Springer, 428--436.Google ScholarGoogle Scholar
  22. Agner Fog. 2022. Instruction tables. Technical University of Denmark (2022). https://www.agner.org/optimize/instruction_tables.pdfGoogle ScholarGoogle Scholar
  23. François Le Gall. 2014. Powers of tensors and fast matrix multiplication, In Proceedings of the 39th international symposium on symbolic and algebraic computation. Proceedings of the International Symposium on Symbolic and Algebraic Computation, ISSAC, 296--303. https://doi.org/10.1145/2608628.2608664Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Marijn J.H. Heule, Manuel Kauers, and Martina Seidl. 2021. New ways to multiply 3 × 3-matrices. Journal of Symbolic Computation 104 (2021). https://doi.org/10. 1016/j.jsc.2020.10.003Google ScholarGoogle Scholar
  25. John E Hopcroft and Leslie R Kerr. 1971. On minimizing the number of multi- plications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 1 (1971), 30--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Joseph JáJá. 1980. On the Complexity of Bilinear Forms with Commutativity. SIAM J. Comput. 9 (1980). Issue 4. https://doi.org/10.1137/0209056Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Larisa D. Jelfimova. 2019. A New Fast Recursive Matrix Multiplication Algorithm. Cybernetics and Systems Analysis 55 (7 2019), 547--551. Issue 4. https://doi.org/ 10.1007/s10559-019-00163-2Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Larisa D. Jelfimova. 2021. A Fast Recursive Algorithm for Multiplying Matrices of Order n = 3q (q > 1). Cybernetics and Systems Analysis 57 (2021). Issue 2. https://doi.org/10.1007/s10559-021-00345-xGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hong Jia-Wei and Hsiang-Tsung Kung. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing. 326--333.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten lessons from three generations shaped Google's TPUv4i: Industrial product. Proceedings - International Symposium on Computer Architecture 2021-June. https://doi.org/10.1109/ISCA52012.2021.00010Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Elaye Karstadt and Oded Schwartz. 2020. Matrix Multiplication, a Little Faster. Journal of the ACM (JACM) 67, 1 (2020), 1--31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Julian D. Laderman. 1976. A noncommutative algorithm for multiplying 3 × 3 matrices using 23 multiplications. In American Mathematical Society, Vol. 82. 126--128.Google ScholarGoogle ScholarCross RefCross Ref
  34. Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel Strassen: Implementation and performance. International Conference for High Performance Computing, Networking, Storage and Analysis, SC. https://doi.org/10.1109/SC.2012.33Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Charles F. Van Loan. 2000. The ubiquitous Kronecker product. Journal of computational and applied mathematics 123, 1--2 (2000), 85--100. https://doi. org/10.1016/S0377-0427(00)00393-9Google ScholarGoogle Scholar
  36. Roy Nissim and Oded Schwartz. 2019. Revisiting the I/O-complexity of fast matrix multiplication with recomputations. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 482--490.Google ScholarGoogle ScholarCross RefCross Ref
  37. Victor Y. Pan. 1980. New Fast Algorithms for Matrix Operations. SIAM J. Comput. 9 (1980). Issue 2. https://doi.org/10.1137/0209027Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Victor Y. Pan. 1982. Trilinear aggregating with implicit canceling for a new acceleration of matrix multiplication. Computers and Mathematics with Applications 8 (1982). Issue 1. https://doi.org/10.1016/0898-1221(82)90037--2Google ScholarGoogle Scholar
  39. Robert L Probert. 1976. On the additive complexity of matrix multiplication. SIAM J. Comput. 5, 2 (1976), 187--203.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Andreas Rosowski. 2019. Fast Commutative Matrix Algorithm. arXiv preprint arXiv:1904.07683 (4 2019). http://arxiv.org/abs/1904.07683Google ScholarGoogle Scholar
  41. Arnold Schönhage. 1981. Partial and Total Matrix Multiplication. SIAM J. Comput. 10 (1981). Issue 3. https://doi.org/10.1137/0210032Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jacob Scott, Olga Holtz, and Oded Schwartz. 2015. Matrix multiplication I/O-complexity by path routing. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures. 35--45.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Alexey V. Smirnov. 2013. The bilinear complexity and practical algorithms for matrix multiplication. Computational Mathematics and Mathematical Physics 53 (2013). Issue 12. https://doi.org/10.1134/S0965542513120129Google ScholarGoogle Scholar
  44. Lorenzo De Stefani. 2019. The I/O Complexity of Hybrid Algorithms for Square Matrix Multiplication. In 30th International Symposium on Algorithms and Computation, ISAAC 2019, December 8-11, 2019, Shanghai University of Finance and Economics, Shanghai, China (LIPIcs, Vol. 149), Pinyan Lu and Guochuan Zhang (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 33:1--33:16. https://doi.org/10.4230/LIPIcs.ISAAC.2019.33Google ScholarGoogle Scholar
  45. Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13 (1969). Issue 4. https://doi.org/10.1007/BF02165411Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Volker Strassen. 1973. Vermeidung von Divisionen. Journal fur die Reine und Angewandte Mathematik 1973 (1973). Issue 264. https://doi.org/10.1515/crll.1973. 264.184Google ScholarGoogle Scholar
  47. Abraham Waksman. 1970. On Winograd's Algorithm for Inner Products. IEEE Trans. Comput. C-19 (1970). Issue 4. https://doi.org/10.1109/T-C.1970.222926Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Virginia V. Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd, In Proceedings of the forty-fourth annual ACM symposium on Theory of computing. Proceedings of the Annual ACM Symposium on Theory of Computing, 887--898. https://doi.org/10.1145/2213977.2214056Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Shmuel Winograd. 1968. A New Algorithm for Inner Product. IEEE Trans. Comput. C-17 (1968), 693--694. Issue 7. https://doi.org/10.1109/TC.1968.227420Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Shmuel Winograd. 1968. On the number of multiplications necessary to compute certain functions. Communications on Pure and Applied Mathematics 23 (1968). Issue 2. https://doi.org/10.1002/cpa.3160230204Google ScholarGoogle Scholar
  51. Shmuel Winograd. 1971. On multiplication of 2 × 2 matrices. Linear Algebra and Its Applications 4, 4 (1971), 381--388. Issue 4. https://doi.org/10.1016/0024-3795(71)90009-7Google ScholarGoogle ScholarCross RefCross Ref
  52. Shmuel Winograd. 1976. Private communication with R. Probert [39].Google ScholarGoogle Scholar
  53. Cui Qing Yang and Barton P. Miller. 1988. Critical path analysis for the execution of parallel and distributed programs. Proceedings - International Conference on Distributed Computing Systems 8. https://doi.org/10.1109/dcs.1988.12538Google ScholarGoogle Scholar

Index Terms

  1. Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SPAA '23: Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures
        June 2023
        504 pages
        ISBN:9781450395458
        DOI:10.1145/3558481

        Copyright © 2023 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 June 2023

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate447of1,461submissions,31%

        Upcoming Conference

        SPAA '24
      • Article Metrics

        • Downloads (Last 12 months)118
        • Downloads (Last 6 weeks)22

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader