Article

Free Access

Recursive array layouts and fast parallel matrix multiplication

Authors:
Siddhartha Chatterjee

Department of Computer Science, The University of North Carolina, Chapel Hill, NC

Department of Computer Science, The University of North Carolina, Chapel Hill, NC
View Profile

,
Alvin R. Lebeck

Department of Computer Science, Duke University, Durham, NC

Department of Computer Science, Duke University, Durham, NC
View Profile

,
Praveen K. Patnala

Department of Computer Science, The University of North Carolina, Chapel Hill, NC

Department of Computer Science, The University of North Carolina, Chapel Hill, NC
View Profile

,
Mithuna Thottethodi

Department of Computer Science, Duke University, Durham, NC

Department of Computer Science, Duke University, Durham, NC
View Profile

SPAA '99: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architecturesJune 1999Pages 222–231https://doi.org/10.1145/305619.305645

Published:01 June 1999Publication History

SPAA '99: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures

Pages 222–231

References

1.I. Banicescu and S. E Hummel. Balancing processor loads and exploiting data locality in N-body simulations. In Proceedings of Supercomputing'95 (CD-ROM), San Diego, CA, Dec. 1995. Available from http://www.supercomp.orglscg5/proceedings/594.BHUM/SC95.HTM. Google ScholarDigital Library
2.T. Bially. Space-filling curves: Their generation and their application to bandwidth reduction. IEEE Transactions on Information Theory, IT-15(6):658-664, Nov. 1969.Google ScholarDigital Library
3.J. Bilmes, K. Asanovit, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, pages 340-347, Vienna, Austria, July 1997. Google ScholarDigital Library
4.R. D. Blumofe, C. E Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207-216, Santa Barbara, CA, July 1995. Also see http:Htheory.lcs.mit.edufcilk. Google ScholarDigital Library
5.S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, CA, Oct. 1994. Google ScholarDigital Library
6.L. Carter, J. Ferrante, and S. F. Hummel. Hierarchical tiling for improved superscalar performance. In International Parallel Processing Symposium, Apr. 1995. Google ScholarDigital Library
7.S. Chatterjee, J. R. Gilbert, R. Schreiber, and S.-H. Teng. Optimal evaluation of array expressions on massively parallel machines. ACM Trans. Prog. Lang. Syst., 17(1):123-156, Jan. 1995. Google ScholarDigital Library
8.M. Cierniak and W. Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 205-217, La Jolla, CA, June 1995. Google ScholarDigital Library
9.D. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware~Software Approach. Morgan Kaufmann, 1998. Google ScholarDigital Library
10.J. J. Dongarra, J. D. Croz, I. S. Duff, and S. Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 16(1): I-17, Jan. 1990. Google ScholarDigital Library
11.P. C, Fischer and R. L. Probert. Efficient procedures for using matrix algorithms. In Automata, Languages and Programming, number 14 in Lecture Notes in Computer Science, pages 413-427. Springer-Verlag, 1974. Google ScholarDigital Library
12.J. D. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance with source code. In Proceedings of the Sixth ACM SIG- PLAN Symposium on Principles and Practice of Parallel Programming, pages 206-216, Las Vegas, NV, June 1997. Google ScholarDigital Library
13.M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings oflCASSP'98, volume 3, page 1381, Seattle, WA, 1998. IEEE.Google ScholarCross Ref
14.M. E Goodchild and A. W. Grandfield. Optimizing raster storage: an examination of four alternatives. In Proceedings of Auto-Carto 6, volume 1, pages 400-407, Ottawa, Oct. 1983.Google Scholar
15.M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, University of Illinois at Urbana-Champaign, Urbana, IL, Sept. 1992. Available as technical reports UILU-ENG-92-2237 and CRHC-92-19. Google ScholarDigital Library
16.E G. Gustavson. Recursion leads to automatic variable blocking for dense linearalgebra algorithms. 1BM Journal of Research and Development, 41 (6):737-755, Nov. 1997. Google ScholarDigital Library
17.N.J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadephia, 1996. Google ScholarDigital Library
18.D. Hilbert. 0ber stetige Abbildung einer Linie auf ein Fl~ichensttlck. Mathematische Annalen, 38:459--460, 1891.Google ScholarCross Ref
19.M.D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Trans. Comput., C-38(12): 1612-1630, Dec. 1989. Google ScholarDigital Library
20.Y. C. Hu, S. L. Johnsson, and S.-H. Teng. High Performance Fortran for highly irregular problems. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 13-24, Las Vegas, NV, June 1997. Google ScholarDigital Library
21.S. E Hummel, I. Banicescu, C.-T. Wang, and J. Wein. Load balancing and data locality via fractiling: An experimental study. In Language, Compilers and Run- Time Systems for Scalable Computers. Kluwer Academic Publishers, 1995.Google Scholar
22.H.V. Jagadish. Linearclustering of objects with multiple attributes. In H. Garcia- Molina and H. V. Jagadish, editors, Proceedings of the 1990 ACM SIGMOD International Conference on Managementof Data, pages 332-342, Atlantic City, N J, May 1990. ACM, ACM Press. Published as SIGMOD RECORD 19(2), June 1990. Google ScholarDigital Library
23.K. Kennedy and U. Kremer. Automatic data layout for distributed memory machines. ACM Trans. Prog. Lang. Syst., 1998. To appear. Google ScholarDigital Library
24.K. Knobe, J. D. Lukas, and G. L. Steele Jr. Data optimization: Allocation of arrays to reduce communication on SIMD machines. 1ournal of Parallel and Distributed Computing, 8(2): 102-118, Feb. 1990. Google ScholarDigital Library
25.C.H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steele Jr., and M. E. Zosel. The High Performance Fortran Handbook. Scientific and Engineering Computation. The MIT Press, Cambridge, MA, 1994. Google ScholarDigital Library
26.M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, Apr. 1991. Google ScholarDigital Library
27.R. Laurini. Graphical data bases built on Peano space-filling curves. In C. E. Vandoni, editor, Proceedings of the EUROGRAPHICS'85 Conference, pages 327-338, Amsterdam, 1985. North-Holland.Google Scholar
28.C. E. Leiserson. Personal communication, Aug. 1998.Google Scholar
29.M. E. Mace. Memory Storage Patterns in Parallel Processing. Kluwer international series in engineering and computer science. Kluwer Academic Press, NorwelI, MA, 1987. Google ScholarDigital Library
30.M. Mano. Digital Design. Prentice-Hall, 1984. Google ScholarDigital Library
31.B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis of the clustering properting of Hilbert space-filling curve. Technical Report CS-TR-3611, Computer Science Department, University of Maryland, College Park, MD, 1996. Google ScholarDigital Library
32.G. Peano. Sur une courbe qui remplit toute une aire plaine. Mathematische Annalen, 36:157-160, 1890.Google ScholarCross Ref
33.J. R. Pilkington and S. B. Baden. Dynamic partitioning of non-uniform structured workloads with spacefilling curves. IEEE Transactions on Parallel and Distributed Systems, 7(3):288-300, Mar. 1996. Google ScholarDigital Library
34.H. Sagan. Space-Filling Curves. Springer-Verlag, 1994. ISBN 0-387-94265-3.Google Scholar
35.J. P. Singh, T. Joe, J. L. Hennessy, and A. Gupta. An en~pirical comparison of the Kendall Square Research KSR- 1 and the Stanford DASH multiprocessors. In Proceedings of Supercomputing '93, pages 214-225, Portland, OR, Nov. 1993. Google ScholarDigital Library
36.V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354-356, 1969.Google ScholarDigital Library
37.M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen's matrix multiplication for memory efficiency. In Proceedings ofSC98 (CD-ROM), Orlando, FL, Nov. 1998. Available from http ://www.supercomp.org/sc98. Google ScholarDigital Library
38.M. S. Warren and J. K. Salmon. A parallel hashed Oct-Tree N-body algorithm. In Proceedings of Supercomputing'93, pages 12-21, Portland, OR, Nov. 1993. Google ScholarDigital Library
40.M.E. Wolf and M. S. Lain. A data locality optimizing algorithm. In Proceedings of the ACM S1GPLAN'91 Conference on Programming Language Design and Implementation, pages 30-44, Toronto, Canada, June 1991. Google ScholarDigital Library
41.M.J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing'89, pages 655-664, Reno, NV, Nov. 1989. Google ScholarDigital Library

Index Terms

Recursive array layouts and fast parallel matrix multiplication

Recommendations

A framework for practical parallel fast matrix multiplication
PPoPP '15

Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's ...
Read More
Communication-optimal parallel algorithm for strassen's matrix multiplication
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The ...
Read More
Recursive Array Layouts and Fast Matrix Multiplication

The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '99: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
June 1999
261 pages
ISBN:1581131240
DOI:10.1145/305619
Chairmen:
Gary Miller
Carnegie Mellon Univ.
,
Vijaya Ramachandran
Univ. of Texas, Austin
Copyright © 1999 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1999
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SPAA '99 Paper Acceptance Rate26of90submissions,29%Overall Acceptance Rate447of1,461submissions,31%
More
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 1,282
  Total Downloads
- Downloads (Last 12 months)160
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Recursive array layouts and fast parallel matrix multiplication

SPAA '99: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures

References

Cited By

Index Terms

Recommendations

A framework for practical parallel fast matrix multiplication

Communication-optimal parallel algorithm for strassen's matrix multiplication

Recursive Array Layouts and Fast Matrix Multiplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Recursive array layouts and fast parallel matrix multiplication

SPAA '99: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures

References

Cited By

Index Terms

Recommendations

A framework for practical parallel fast matrix multiplication

Communication-optimal parallel algorithm for strassen's matrix multiplication

Recursive Array Layouts and Fast Matrix Multiplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media